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PREFACE 


The 1979 International Conference on Parallel 
Processing is the eighth of a series of annual 
meetings initiated in 1972 at Sagamore, N. Y. 

A tradition has developed characterizing this 
conference: the papers presented are heavily 
oriented towards research topics but with a very 
pragmatic flavor. Also, the remoteness of the 
meeting location and the informal atmosphere have 


fostered the spirit of exchange among participants. 


We hope to encourage the continuation and 
expansion of this excellent tradition. This year 
we received a total of 93 papers of which 23 came 
from 10 countries other than the U. S. despite 
the increasing number of emerging conferences on 
closely related topics. In an effort to better 
serve the participants we have planned to have 
these Proceedings printed and available at the 
time of the conference. You will not, therefore, 
find the awards for most original paper and best 
presentation announced in the following pages, 
but the selection will take place at the confer- 
ence as usual. 
men are not acknowledged at this time. The new 
publication schedule has unquestionably imposed 
tighter time constraints than ever before on 
authors and reviewers alike. To them should go 
the credit of the accomplishment and my heartfelt 
appreciation for their willing cooperation. 

The present program reflects some new and 
broader trends towards concurrent computing. It 
is illuminating to analyze the changing interests 
of the authors. My assessment is that the empha- 
sis has shifted from the hardware organization 
features to more conceptual language translator 
and algorithmic topics. More attention is given 
to the synchronization and control issues in the 
languages and models of parallel architectures as 
noticed in the first two sessions. I consider 
this a very natural and healthy development. 
Also, the topics of searching and reconfigurable 
_ systems showed more strength than in the past 
when compared to better established subjects such 
as performance evaluation, parallel arithmetic 
and pipelining. The session on networks for 
interconnection promises to be one of the strong- 
est ever; some fascinating concepts are presented 
on array processors and there are novel results 
on special purpose multiprocessor architectures. 


For this same reason session chair- 


Tse Feng comes first in my list of acknowl- 
edgments on the organization of this conference 
as General Chairman, particularly when I conside 
his many other present commitments. He and his 
assistants handled the arrangements, publicity 
and tentative program printing and distribution. 
Next Annette Krygiel has planned a panel sessior 
in which practitioners will focus on the most ur 
gent problems facing parallel computing. A form 
program committee was not appointed and I believ 
this contributed to both the spontaneity and 
heterogeneous nature of. the papers received. 
did make the task of the Program Chairman more 
involved and I would recommend such a committee 
for the future on that basis. The difficulties 
were ameliorated by a number of reviewers who 
contributed well beyond the call of duty in a 
variety of circumstances. From the list on 157 
reviewers recognized later in these Proceedings, 
I would like to particularly thank Tilak Agerwal 
Jean-Loup Baer, Bruce Berra, Dave Davis, Mario 
Gonzalez, Robert Keller, Willis King, Jack 
Lipovski, Mike Liu, Nancy McDonald, Ken Thurber, 
Kishor Trivedi, and Dave VanVoorhis, among other 
in that category. Also Mariagiovanna Sami. and 
Chris Vissers publicized enthusiastically our 
call for papers in Europe and to them we express 
our gratitude. 

The presentation by our keynote speaker, Dr 
Paul Schneck which is appropriately the only un- 
refereed contribution to these Proceedings, sets 
the pace for the papers that follow. We appre- 
ciate his willingness to accomodate to our sched- 
ule and his sharing his views with us. 

Last but foremost, I want to acknowledge thc 
assistance of the Secretary of our Computer 
Science Program at the University of South Florida, 
Mrs. Brenda Malowney, who is responsible for the 
organization, addressing and copies of more than 
600 pieces of mail involving authors, reviewers 
and other participants. 

We hope that the labors of those involved 
may be repaid by your enjoyment of the conference. 
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ISSUES IN PARALLEL COMPUTING: 
A NON-EUCLIDEAN EXAMINATION 


Paul B. Schneck 
Office of the Director 
NASA/Goddard Space Flight Center 
Greenbelt, MD 20771 


Abstract -- In this talk I will review some 
of the identifiable milestones of the evolution 
of computers toward what we currently term 
parallel systems. A significant observation is 
that parallelism has been present in computing 
systems from the beginning, but that programmers 
did not have to deal with it until recently. 

One might even conclude that the difficulties 
currently associated with parallel computers 
originate with software and programming and not 
exclusively with the innovative architectures of 
those machines. We will discuss this issue and 
explore potential approaches to a solution. 


Introduction 


I am going to to identify some of the broader 
underlying issues relating to parallel processing. 
Only after we have identified the appropriate 
issues can we begin to make substantial progress 
toward their resolution. 


As a starting point, let us note that 55 
papers will be presented this year, an increase of 
more than 25 percent over the 42 papers presented 
last year. The magnitude of our parallel process- 
ing vector has clearly grown. What about its 
Orientation? To understand the orientation of 
the conference we will classify the papers (some- 
what arbitrarily) into one of three possible 
subject areas: hardware, software, and algorithms. 


We begin by briefly looking at the group of 
papers classified as hardware related. The 
thrust in this area has shifted somewhat from 
geometric considerations to those dealing with 
concurrency. The lock-step model for parallel 
processing has begun to fade. This year there 
are sessions devoted to synchronization and to 
serial-by-bit arithmetic. What is the underlying 
theme? 


In the area of software, the number of 
papers has declined, even as the conference has 
grown. Where this stems from a reduction to 
practice of state-of-the-art software, the field 
has grown. But where this is a result of our 
inability to cope with facing the problems of 
parallel computing, we may be in trouble. The 
traditional software work in languages continues. 
In fact, there is already talk of standardizing 
the language for array processing. The decreased 
emphasis in the software area may result from a 
lack of either "push" or "pull." A "pull" comes 
about in response to requirements for improvements 
which can be brought about by software. Surely 
there is no lack here. A "push" develops when 
new algorithms are available for implementation. 
It appears that this is the bottleneck. 


If there has been a bottleneck due to the 
lack of availability of algorithms for parallel 
processing, then this conference brings reason 
to expect an end to that situation. The growth 
from last year to this year occurs in the area 
of algorithms. 


There seems to be a recognition that the 
mere existence of a problem's solution, perhaps 
demonstrated by software, does not result ina 
practical application of parallel processing. 

If this is the case, then we may look forward to 
renewed vigorous efforts in software, based ona 
sound algorithmic foundation. 


Parallel Computer Development : The Process 


In this section we will examine the process 
leading to the development of parallel computers 
with an eye toward overcoming any apparent 
deficiencies. 


The Traditional Approach. The traditional 
approach to development of a parallel processing 


capability is shown in figure 1. We note that 
the engineering inspiration which underlies all 
of the following activities may be decoupled from 
the discipline activities for which the system 
will be used. 
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FIG. 1. THE TRADITIONAL APPROACH TO 
. PARALLEL PROCESSING 


What does the engineer use as a basis for 
his design? While there is no definitive answer 
to this question, there does seem to be a reason- 
able response: Lacking other inputs, the engineer 
attempts to optimize the performance of the 
computer (and here we are referring only to the 
mainframe) in terms of potential results per unit 
of hardware. It is this phenomenon that was 
behind the slow acceptance of floating point and 
its eventual incorporation into hardware. After 
all, the existence of separate execution facili- 
ties capable of handling both fixed pceint arith- 
metic means that when one facility is in use, the 
other is idle--perhaps a convenience, but not * 
an optimization. The quest for component utili- 
zation is so deeply embedded that the requirement 
for general availability of floating point 
stimulated two departures in architecture. 


In the first instance, typified by the CDC 
6600 and IBM 360/91, additional control circuitry 
was added to the instruction unit of the computer 
so that both the floating point and fixed point 
execution units could perform concurrently. In 
the second instance, which developed at about the 
same time, microprogramming was implemented as a 
means of control of a processor. (The IBM 360/30 
and 360/40 are excellent examples of this.) This 
permitted a single hardware unit, i.e., a serial- 
by-byte adder or multiplier, to be used for 
either floating point or fixed point operations, 
aS appropriate... 


By analogy, the introduction of parallel 
computers (or vector computers) was merely 
responding to the same thrust to make optimum 
use of hardware in a design. There was no 
coupling with a discipline area. Thus, it is 
not surprising that the utility of these first 
machines has been questioned, that their accep- 
tance has been slow in coming. 


The Modified Approach. Because of difficul- 
ties experienced with these initial design efforts 
we have adopted a modified approach to the use of 
parallel machines. This modified approach is 
depicted in figure 2. We note that the basic 
approach to hardware design remained unchanged. 
What has changed is the earlier interaction of 
individuals skilled in the discipline areas which 
will utilize the machine. This earlier involve- 
ment results in new techniques which are respon- 
Sive to the special abilities (as well as the 
relative inabilities) of parallel processors. 


I will cite a particular example to demon- 
strate the importance of algorithmic interaction 
at this point in the process. In the solution 
of the finite difference approximation to the par- 
tial differential equation representing the heat 
flow in a solid, the Gauss-Seidel method, or 
method of successive displacement converges twice 
as fast as the Jacobi method, or method of 
simultaneous displacement. The numerical analysis 
of the two methods reveals that the eigenvalues 
of the former method are the square of eigen- 
values of the latter method. Thus, two 


iterations of the latter method are necessary for 
each iteration of the former method. Now, the 
Fortran representation of the succssive displace- 
ment scheme appears something like: 


DO 1 fI= 
DO 1 J= 2, N-1 
1 A(I,J)=2.5*(A(I-1,3)+A(I+1,J)+A(I,J-1) +A(I,J+1) ) 


While the Fortran representation of the simultane- 
ous displacement scheme appears as: 

I = 2,N-1 

DO 1 J = 2,N-1 


L B(I,d)=.25*(A(I-1,J)+A(I+1,J)+A(I,J-1)+A(I,J+1) ) 


DO 2 I= 2,N-1 
2 A(I,J) = B(I,d) 


“Clearly the method of successive displacement is 


not only faster, but less cumbersome, easier to 
read, and occupies only half as much space for 
data as compared with the method of simultaneous 
displacement. Naturally, we almost exclusively 
see the method of simultaneous displacement 
implemented for conventional, sequential machines. 
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FIG. 2, MODIFIED TRADITIONAL APPROACH 
TO PARALLEL PROCESSING. 


A New Approach. What, if anything, does 
this have to do with parallelism? The answer is 
that when we aim to solve a heat flow problem on 
a parallel machine we need to reexamine the way 
in which the hardware will perform the Fortran 
statements. In the classical case of a lock-step 
parallel processor (i.e., the ILLIAC-IV) the 
method of simultaneous displacement is the natural 
mode of operation. In order to implement the 
method of successive displacement it would be 
necessary to operate with only one processing 
element at a time, defeating the potential advan- 
tage of a parallel processor. 


To summarize, the method of simultaneous 
displacement uses all of the processing elements 
e.g. 64 for the ILLIAC-IV, and requires twice as 
Many iterations as the method of successive 
displacement, which results in an advantage of 
e.g. 32 for this "inferior" algorithm. 


Let us now consider how we find the maximum 
value of a set. With a sequential machine we 
merely step through the entire set, retaining 
the maximum, until we reach the end. At that 
time the value retained is the maximum for the 
set. This is depicted in Fortran as: 


AMX = A(l1) 
DO 1= I=2,N 
1 AMX = AMAX (AMX, A(I)) 


On a parallel machine we want to take 
advantage of the simultaneous availability of 
computing resources. This is a crucial point, 
so I will clarify the intent of that statement. 
We do not wish merely to maximally utilize the 
processor's resources. We do wish to use the 
resources to reduce problem solution time. 


When dealing in these circumstances we are 
no longer interested in the strict computational 
complexity of an algorithm. It may be preferable 
to perform more operations to solve a problem 
and yet obtain a faster solution. 


In obtaining the maximum of a set of N 
elements many of the processing elements will 
remain idle during the log 5 N steps necessary. 


Thus, we arrive at the process depicted in 
figure 3 as a recommendation for a system 
approach to problem solving. 


The Role of Software 


Following are some general observations 
about software. They are particularly pertinent 
to all of us, as practitioners of parallel 
processing: 


1. Computer instruction sets typically 
have 100-200 instructions. 

2. The majority of programs are written 
in compiler languages (e.g. Fortran, COBOL, 
Ada). 

3. Compilers usually generate only 50-60 
different instructions. | 


During a project on compiler portability, 
I transferred to a CDC 6600 compiler for "LITTLE" 
to an IBM 360. This was accomplished ina 
straightforward manner by transliterating the 
instructions generated for the CDC 6600 to 
their counterparts for the IBM 360. There are 
a few instructions of course, e.g. population 
count and pack, which do not have direct counter- 
parts. These required special treatment but do 
not conflict with the above observations. 
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Based on these observations, one can infer 
that: 


1. The full range of instructions available 
on present computers is not being utilized. 

2. Broadening the semantic range of pro- 
gramming languages might result in significant 
improvements in ease of programming, speed, and 
space utilization. 


This view of the state of programming 
languages and compilers is depicted in figure 4. 
The key issue is brought to the fore with this 
figure. We do not know, indeed, we cannot deter- 
mine, the performance of the software that is 
the channel between the discipline activity and 
the computer system. 
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We find ourselves without a metric. At 
best it is difficult to make progress in this 
mode. We cannot easily determine whether we 
are approaching our target or withdrawing from 
it. We do not know how far we have to go to 
reach our goal. In the case of building circuits 
to perform binary addition there was a continual 
refinement and improvement of performance. 
Relatively late in the era of building such cir- 
cuits a metric was devised. Only then did we 
know for certain that we were truly near our 
target. 


A similar result for matrix multiplication 
removed the previous barrier to performance 
which had long stood unchallenged. Recent results 
have lowered the number of multiply operations 
a second time. 


A Spectrum of Parallel Processors 


There has been steady progress across the 
spectrum of computer organizations from initial, 
fully sequential processors to parallel pro- 
cessors. While this has occurred the structure 
of programming languages has not kept pace. 


In figure 5 we see that the first steps in 
this progression were not visible to the 
programmer. In fact, they were almost invisible, 
even at the instruction set level. Except for 
differences in timing there were no. functional 
changes caused by early parallelism. Compilers 
with instruction schedulers made it possible 
for programmers to completely ignore these new 
Capabilities. Later steps in this direction 
have radically changed the instruction set 
capabilities and induced some changes in program- 
ming languages. This is a specific case of the 
Situation to which we referred earlier. 


As a computer user, I can measure a system's 
effectiveness only in broad terms. Programming 
languages and compilers must now advance so that 
we can exploit the advantages of parallel archi- 
tectures. 
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FIGURE 5. SPECTRUM OF PARALLELISM 


Conclusion 


We need to address the problems of concep- 
tion, design, and software for parallel processor 
systems in a new fashion. We must look at all 
the elements of this assemblage as a system 
requiring optimization. If we address any indi- 
vidual component we may achieve only a local 
maximum and forego an opportunity to improve the 
system. 


I will conclude by providing questions, not 


answers, to this audience. 


Why does current software frequently blur 
the distinction between parallel processors and 
vector processors? 


What is the critical element in parallel 
processing: data structure, execution scheduling, 
algorithm, etc.? 


Can we develop a metric with which we can 
judge the relative advantages and disadvantages 
of alternatives in hardware, software, and 
algorithms? 


What are the primitives on which concepts of 
parallelism are based? 


The dialogue generated by this conference 
will lead toward resolution of those issues. 


* 
HIGH-SPEED MULTIPROCESSORS AND THEIR COMPILERS 


D. J. Kuck and D. A. Padua 
Department of Computer Science 
University of Illinois at Urbana-Champaign 
Urbana, Illinois 61801 


Abstract -- High speed multiprocessors are 
seen as a means of speeding up a wide class of 
computations that are not amenable to array 
processing. We discuss the structure of such 
machines and compare them to other organizations. 
The key to their efficient use is good compiler 
algorithms, and we present several approaches to 
compilation. 


1. Computer Structures 
Introduction 


Parallelism in computer systems arises for 
two different reasons. One is to increase the 
speed of execution of a single program, and the 
other is to increase the throughput of a multi- 
programmed system. Among existing computers, 
most parallel and pipeline array processors have 
been designed with the first reason in mind and 
most multiprocessors with the second. 


Most present multiprocessors do not seem to 
consider in their architecture any features aimed 
specifically at the speedup of a single program, 
except for the fact that they have a memory which 
can be accessed by every processor. This is obvi- 
ously good enough for multiprogramming or for 
speeding up programs capable of decomposition 
into processes requiring a low frequency of inter- 
action. However, it is clear that by allowing a 
higher frequency of interaction, not only could 
more speedup be obtained, but also the class of 
programs capable of speedup would increase notice- 
ably. 


In this paper we address the question of 
whether additional architectural features could 
enhance the performance of a multiprocessor with 
respect to the speed of single programs. The 
best way to proceed seems to be to consider struc- 
tures frequently found in programs and to evaluate 
the impact of different designs upon their execu- 
tion time. In contrast to the language-directed 
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approach to machine design, we advocate the 
program-directed approach. We have measured 
FORTRAN program characteristics,with the goal of 
discovering the limitations of array machine 
structures ,for some time [KuMC72], [KBCD74]. The 
same methods have been used for COBOL [Stre/7/4], 
GPSS [Davi72], and SNOBOL [Chen77], some of which 
led to multiprocessors, as well. We are now 
making measurements of programs that are not suit- 
able for array processing, with the goal of 
designing a high-speed multiprocessor. Such a 
machine could be used to speed up a single program 
or to enhance throughput by multiprogramming. 


1.1 A Machine Survey 


In [Kuck78, Sect. 4.2.5], a control unit 
taxonomy was presented that is similar to that of 
[Flyn72] but can be extended to more classes of 
machines; 16 categories were mentioned. The idea 
is to view a control unit as accepting one or more 
instruction sequences and generating execution 
sequences for the rest of the machine. For the 
moment, we will be concerned only with execution 
sequences and will restrict our attention to four 
kinds of machines: 


SES: single execution, scalar; 
MES: multiple execution, scalar; 
SEA: single execution, array; and 


MEA: multiple execution, array. 


An SES machine is a traditional serial com- 
puter. Many SEA machines have been built, 
including pipeline and parallel processors for 
which single instructions generate an array of 
executable operations. In fact, pipeline array 
processors attached to minicomputer hosts are now 
in widespread use. A number of array machines 
(including the CRAY-1 and TI ASC) can execute 
several array operations simultaneously and may be 
regarded as MEA machines. The MES category 
includes several types of machines that can exe- 
cute multiple scalar operations at once; for 
example, the CDC 6600 or various multiprocessors. 
Clearer distinctions between machine types can be 
made by also considering instruction sequences (as 
mentioned above), but that is beyond our present 


scope. Any SEA, MEA or MES machine may be 
regarded as a multioperation machine, because sev- 
eral operations are being carried out at once. 
However, we are concerned with high-speed execu- 
tion of single programs, and will consider the 
shortcomings of traditional array processors 
together with how a kind of multiprocessor can be 
used to improve the situation. 


Historically, when SES machines were seen to 
be too slow and technology was not speeding up 
fast enough, architectural innovations were used 
to achieve faster turnaround. For example, the 
CDC 6600 allowed several operations to be carried 
out at once and the 360/91 added pipelining to ~ 
this idea. A combination of compiler software, 
control unit hardware, and processor hardware was 
used to attain speed improvements over SES ma- 
chines. These machines (and their successors) 
may be called "data-flow" machines because data 
and control dependences are used to determine when 
operations can be executed, so some operation 
overlap is possible. Research in this area con- 
tinues [DeMi74], [KeLP79], [Davi79]. On the other 
hand, SEA machines generally rely on a user or 
software vectorizer to introduce explicit array 
statements in a source program, and these are then 
executed at high speed. It is sometimes difficult 
for users to rethink and rewrite their programs 
properly. Automatic vectorizers, although they 
can be much more powerful than hand reprogramming, 
have not been commercially available until re- 
cently. 


It may be observed that certain algorithms 
are amenable to substantial speedup by a combina- 
tion of these ideas. For. example, the merging and 
sorting networks of Batcher [Batc68], the FFT net- 
work of Pease [Peas68], the arithmetic expression 
tree evaluation of Kuck and Muraoka [Swan/2], 
[VaZi78], etc. 


Machines in which the completion of one oper- 
ation triggers the next can become complex and 
cumbersome when one attempts to push the idea too 
far. Array machines and vectorizers are the fast- 
est available systems today, but it can be shown 
abstractly and empirically, that they are ineffi- 
cient for certain classes of computations. What 
then is the proper next step toward "ultimate 
speed" machines that are useful in a wide range of 
applications? 


We believe that since a fairly wide class of 
computations can be successfully vectorized for 
array machines [KuMC72], [KBCD74], [Kuck77], 
[CKTB79], one should take this as given and study 
the difficulties with the remaining computations. 
One should look for additional compiler algorithms 
and hardware flexibilities that lead to substan- 
tial speedups on a much wider class of computa- 
tions. Thus we are led to a class of machines 
that can behave as a high-speed array processor 
when appropriate and can behave as a high-speed 
multiprocessor when necessary. The data and con- 
trol flow notions of earlier machines must, of 
course, be exploited by such a system in an effi- 
cient manner. 


1.2 A High-Speed Multiprocessor 


The following sketch of a high-speed multi- 
processor is preliminary. We have studied a num- 
ber of algorithms and programs and believe this 
offers a substantial improvement over current ma- 
chines. Ideas about compiling for this machine 
will be presented in later sections. We are in 
the process of implementing these and, after ana- 


lyzing our collection of over 1,000 programs, we 


expect the architecture to become clearer. 


First, we describe a processor cluster (PC) 
that can behave as a SIASEA (single instruction, 
array; single execution, array) machine or as a 
MISMES (multiple instruction, scalar; multiple 
execution, scalar) machine. Each processor in the 
cluster is a fairly traditional machine, with a 
scalar control unit and processor, together with a 
local memory. Thus, each processor may carry out 
an independent computation. The processor capa- 
bilities can be chosen to meet the intended appli- 
cation areas, but the use of LSI processors is 
clearly attractive (e.g., 32-bit floating-point 
microprocessors). The cluster size is also a 
design variable, but from the viewpoints of both 
technology and applications effectiveness, 8 to 
16 processors per cluster seems appropriate. 


Fig. 1 shows two such processor clusters 
(containing c processors each) interconnected via 
a set of local alignment networks (LANs). These 
alignment networks can be used to communicate 
within one cluster (each can use half of the LAN 
independently) or adjacent clusters can inter- 
communicate through the shared LAN. Each cluster 
also has an array control unit that accepts an 
array instruction set and drives all c processors 
in lock-step fashion. 


The alignment networks can be used to commu- 
nicate data between processors and memories. They 
may also be used, simultaneously, to send status 
bits between processors. One important feature of 
each processor control unit is that the execution 
of each instruction is conditional on a set of 
status bits. These can be set by other processors 
in the cluster (or outside the cluster). Thus, a 
set of computations running in a PC can be made 
dependent at the instruction level--so, for ex- 
ample, data computed in processor 1 can be stored 
in a local memory, processor 1 can set an appro- 
priate bit in processor 2, and processor 2 can 
fetch the result of processor 1 to proceed with 
its computation. All of this can be accomplished 
in a matter of a few clocks, so very tight cou- 
pling is possible. For example, if each processor 
is computing one iteration of a loop and there are 
data dependences from one statement to the next 
between iterations, these can be quickly satis-— 
fied. As another example, a PC could evaluate an 
arithmetic expression tree quickly by appropriate 
alignments, see [Swan72]. We shall discuss the 
use of such a PC in more detail later with respect 
to program structures. For the moment it is clear 
that a PC can operate as an MES or SEA machine. 


Next, consider a collection of PCs to form a 


complete system. Fig. 2 shows a collection of n 
PCs plus a set of global control units (GCUs), a 
global alignment network (GAN), and a global mem- 
ory (slower levels in the memory hierarchy are 
ignored here). The GCUs allow a collection of 
PCs to work together either in an array or scalar 
fashion. Thus, the entire system could be parti- 
tioned to handle several jobs at once, but the key 
point is that any job may be handled in either an 
SEA, MEA or MES mode. Program memory is associ- 
ated with each control unit. 


Array access and alignment from the global 
memory through the global alignment network is 
well understood [BuKu71], [KuSt79], [LaVo79], 
{[Lawr75], and the same ideas can be used in each 
PC for array operations. Also, memory access for 
independently distributed addresses is well under- 
stood [ChKL77]. Thus, a set of array computations 
with linear addressing patterns can be assumed to 
work well. Memory stores and fetches for MES 
operation as well as most subscripted subscript 
array accesses should work well also. The problem 
of aligning data in these latter cases has been 
unsolved until recently. We will discuss this in 
more detail in Section 4. 


2. Language/Machine Relations 


The flow of data and control in programs are 
fundamental considerations in the design and per- 
formance of a machine to execute those programs. 
We are interested in relating various program con- 
structs to various computer organizations. In 
order to proceed, we shall first give a broad 
paradigm for the languages and programs we are 
considering. Then we will relate these to the 
machine organizations discussed earlier. 


The language paradigm emphasizes those as- 
pects of programs that most concern us in com- 
piling programs and executing them with high 
performance on the given machine structures. Many 
details are omitted in the interest of brevity. 
The presentation contains three parts: 


1) w-blocks; 
2) DAGs of t-block clusters; and 
3) Control structures for (1) and (2). 


2.1 t1-Blocks 


The name '1—block" means a block that is de- 
rived from a partition of a program's data and 
control dependence graph. The discussion here is 
a generalization of that in [Kuck78], which was 
given in a narrower language setting. 


We shall consider 1-blocks as the smallest 
objects of concern to a compiler in scheduling a 
computation on a machine. It is assumed that 
atomic operations exist in any given machine, and 
that dependence graphs of such atomic operations 
are contained within each T-block. Several exam-— 
ples of 1-blocks follow. 


An arithmetic assignment statement is a T- 
block with atomic arithmetic operations connected 


by a data dependence tree, as well as an atomic 
assignment operator. Similar statements can be 
made about Boolean assignment statements or char- 
acter string assignment statements. Also, 
cons(car(x),cdr(y)) might be a LISP T-block, or a 
complex APL expression might be a T-block. In 
programs for a sorting or merging network machine 
[Batc68], the comparison of two numbers and trans- 
mission of the greater in one direction and les- 
ser in the other direction can be regarded as a 
T-block. A similar definition could be made with 
respect to an FFT network [Peas65], etc. Other 
examples of 1-blocks are decision trees that 
branch to one of several program locations, and 
conditional expressions that may do a branch or 
select one sequence of several assignment state- 
ments [Kuhn79]. 


Thus, t-blocks may have as values data items 
of any type, or program addresses. They can also 
be subject to mode bits that determine whether or 
not particular values are to be computed. 


Programs are collections of 1—blocks of the 
types mentioned above. In order to be able to 
deal with acyclic graphs of objects later (in 
scheduling and in the hardware), we define a T- 
block to be a single node of the type mentioned 
above, or a maximal cycle of such nodes formed by 
data and control dependences. 


2.2 DAGs of 1-Block Clusters 


Just as there were dependences within T- 
blocks, there are dependences among the 1-blocks 
of a program. A graph formed using T-blocks as 
nodes and dependences as arcs is, however, a di- 
rected acyclic graph (DAG), because all depen- 
dence cycles are within T-blocks. For purposes 
of compilation and execution, we may be inter- 
ested in forming nodes that are clusters of sev- 
eral T-blocks; such clusters may be formed from 
T-blocks accessing the same variables, or re- 
quiring similar data alignments for processing, 
etc. Thus, we will now consider clusters of T- 
blocks interconnected in the form of a DAG. 


The arcs in this DAG may represent depen- 
dences due to control or data flow, and there are 
three types of the latter: data dependence, anti- 
dependence and output dependence (cf., [Kuck78]). 
Associated with each of these types of arcs is a 
set of distance vectors, one for each pair of 
array variables causing a dependence. The dis- 
tance vector indicates the difference in sub- 
script values in each position. 


Most programming languages have some kind of 
repetition statement (e.g., DO, FOR, etc.). We 
will assume that the control for such repetitions 
has been distributed down to the level of 1-block 
clusters (see [Kuck78], [CKTB79], or [KuMC72] for 
more details). If a cluster contains T-blocks 
that originated in different repetition state- 
ments, they may be combined using mode bits, for 
example. 


Henceforth, we will consider DAGs of 1-block 


clusters. The clusters are interconnected by data 
dependence arcs labeled with distance vectors, and 
repetitions are associated with T-block clusters. 
Thus, we can deal with a graph that consists of 
antichains of T-block clusters, any of which can 
be executed at once, and the program can be exe- 
cuted by simply observing the dependences between 
the antichains. 


2.3 Statement Execution Ordering 


The atomic operations within a 1—block usu- 
ally have well-known dependence relations (e.g., 
operator precedence for arithmetic expressions). 
At higher levels, it is necessary to assume or 


explicitly specify these dependences. For example, 


most programming languages for traditional com- 
puters assume that statements are normally exe- 
cuted one after the other from the top to the 
bottom of a written page or memory area. When 
hardware parallelism is available, programming or 
compiler techniques are needed to exploit it. 


For blocks of assignment statements, various 
statement execution orderings were specified in 
[Kuck78]. Without repeating the formal defini- 
tions we will sketch the ideas here, and then we 
will extend these ideas to a block of assignment 
statements (or T-block cluster) with an associated 
repetition statement. 


The two broad classes of statement execution 
ordering are SEQ and SIM. SEQ means that all the 
normal data dependence, antidependence and output 
dependence arcs are followed in executing a pro- 
gram, but any statements without such dependences 
between them may be executed in any order. If a 
number of assignment statements are to be executed 
with SIM ordering, all right-hand side atoms must 
be fetched before any left-hand side results are 
stored. These two classes have an intersection 
that contains a class specified by TOG; such 
statements may be executed together, i.e., in any 
order at all, since there are no dependences be- 
tween them. SEQ contains a class called SF (store 
all previous left-hand sides before fetching the 
next right-hand side) which corresponds to the 
strict sequential ordering implied by traditional 
languages running on traditional serial machines. 
Other execution orderings are specified in 
[Kuck78]. 


These ideas can be extended to repetition 
statements as follows. Let I be an ordered set 
(I,<i,,...,i >) called an index set, let B be a 

1 m 


block of assignment statements with a specified 
execution ordering, and let control represent SEQ, 
SIM, SF, TOG or any other execution ordering. 

Then by 


DO control r[B] 
we mean the following. 


1) Expand B according to its statement 
execution ordering. 


2) Copy the result of (1) for ij> i,» Se wey! Sa 


i from left to right. 


3) Apply the execution ordering specified by 
control to this set of m sequences. 


Example A traditional loop (e.g., a FORTRAN 
DO loop) can be specified as 


DO SF T[SF[s,;...3s ]], 
where I = (I,<i,,---,i,>). 


The inner SF requires that statements S, through 


1 
Sy be executed in a serial way with the left-hand 


Side of Ss; being stored before any of the right- 
41 are fetched. This se- 
quence is computed m times: 


hand side atoms of S. 


first for i,> then 


for 15> ---, and finally for iv a 


Other examples will be given in the follow- 
ing section. 


2.4 Machine Considerations 


In this section we will discuss how one 
singly-nested loop can be mapped onto each of the 
several kinds of machine structures discussed 
earlier. The statement execution orderings of 
the previous section will be used to illustrate 
what a user might write or what a pre-compiler 
might generate from a traditional source language 
program. Later we will show how these statements 
can be compiled for high-speed multiprocessors. 


The four types of machines mentioned in 
Section 1 were SES, MES, SEA and MEA. The ex- 
ample of Section 2.3 showed how to specify a 
purely sequential loop for a traditional SES ma- 
chine. This may, in fact, be regarded as the 
meaning of a FORTRAN DO loop. For each of the 
other machine organizations, it is important to 
know some details of the statements in the loop. 
For an SEA machine, some parts of a loop may need 
to be executed as on an SES machine, but others 
can be executed as 


SF[DO SIM r[s,]; DO SIM I[s,]; ...; 
DO SIM r[s J], 


which corresponds to a sequence of array oper- 
ations. The goal of vectorizers for array ma- 
chines is just this kind of code, see [KuMC72], 
[KBCD74]. 


An MEA machine has the additional flexibil- 
ity of being able to execute several array oper- 
ations at once., Thus, in general, the outer SF 
of the SEA machine can be replaced by SEQ to 
allow as many simultaneous array operations as 
the program has and the machine can handle. So 
we have 


SEQ(DO SIM I[S,]; DO SIM Z[S,]; ...5 
pO SIM r[s_]] 


as the canonical MEA machine program. 


Finally, consider the MES machine and several making use of the language of SEQs, SIMs, etc., 


types of programs. In the simplest case, we can wherever convenient. In Section 3.4, we will con- 
execute the same serial program independently in sider the influence of particulars of the target 
each available processor, once for each loop repe- machine, giving special emphasis to the machine 
tition. Thus, we can write described in Section 1. Throughout this section, 
DO TOG I[SF[S.3;8.3.--3S ]] our goal is obtaining as fast execution time as 
1°" 2 n possible. However, even if we consider an ideal 


target computer (like Murtha's IT machine 
[Murt66]) and very simple programs, we find that 
algorithms to obtain the optimum execution time 


to denote a set of m independent repetitions of n 
statements. Note that if some of the Ss. are con- 


ditional expressions, separate paths may be fol- are impossible in practice (they are NP- 
lowed for each of the m cases. Also, this idea complete). For very simple problems, like bin 
can be generalized to a set of distinct and inde- packing, it has been proved that some heuristics 
pendent blocks as might arise from a sequence of give results very close to the optimum. The 
loops. proofs, however, are sometimes quite elaborated 
[Grah76]. In our case, the problem of finding an 
Next, assume an MES machine with a program optimal algorithm or analyzing a heuristic are 
that requires data to be passed between proces- even harder because machines are not ideal and 
sors. In this case we can write programs are, in general, very complex to analyze 
DO SEQ r[SF[S.:S.;.--3S ]] involving, for example, If statements and, there- 
LZ n fore, probabilistic execution times. We are 


forced, then, to abandon any search for optimal 
transformation and content ourselves with engi- 
neering judgment and experimental evaluation of 
our techniques. For these reasons, the statements 
made here about the different transformations are 
tentative; a more concrete assessment must await 
experimentation. 


to denote a set of m SF sequences, each of which 
can be executed at once, subject to whatever de- 
pendences exist between them as indicated by SEQ. 
In order to have such a program execute effi- 
ciently, a tight interprocessor coupling is neces- 
sary. Note that in contrast with the SEA case, we 
have the index on the outside and SF on the inside 
here. The SEA machine executes a sequence of 
array operations, whereas the MES machine executes 
an array of serial computations, for the same 
given source program. 


3.1 m-Block Transformations 


We will consider three types of T-block trans- 
formations. They should be applied in the same 
order they are presented here, as shown in the 


3. The Compiler overall algorithm presented in Section 3.1.4. 


We now proceed to consider a methodology to 


translate programs written in a sequential lan- eee eee bye cen ee eee os 
guage like FORTRAN into code suitable for fast Beene anon SEeECneNty eh le Rey Chae tke 
execution in multioperation machines. porns tS 

DO SF r{SF[s,3S53...3s J} (1) 

The first step is to translate the original 

program into a DAG of T-block clusters. Tech- where, by definition of a 1-block, there is a path 
niques to do this have been developed and imple- in the graph of dependences from S, to S. for all 
mented during the last few years [CKTB79], Sc ; ‘ 
[Wolf 78]. i, j e{1,2,...,n} if cycles are present. 


In cases where it is applicable, the parti- 
tion method will produce a TOG statement seman- 
tically equivalent to (1) of the following form: 


The transformations that should be applied to 
the DAG of T-block clusters are the subject of 
the remainder of this section. For reasons of 


space, we had to choose in our description between TOG{DO SF r,.[SF{sS,;S.;...3;S }]; 
; cas ; : : 1 dia’. n 
clarity and precision. We will strive to obtain | 
the first, relying mostly on examples (see DO SF I, (SF{S,3S,3---35 313 (2) 


[Padu79] for more details). 


Tr £ ti DAGs of m-block clust 
'ransformations on s of 1-block clusters DO SF raster eee a 


can be classified as follows: 
1) T-block transformations; where the Is, i=l, 2, ..., m are pairwise dis- 
2) m-block cluster transformations; and joint and their union is I. We will demonstrate 


3) DAG of tT—block cluster transformations. its app rtespattey by several examples. 


When the target machine is of the SEA type, the Example 3.1 The Loop 
T™block cluster transformations and some of the 
t-block transformations will not be applicable. DO IT=1,M 

We will study each one of these types of i BE Cey BCL yor CCl) 
transformations in turn. We will try to make our ENDO 


considerations as machine independent as possible, 


represents a vector operation (here S45 is a T- 


block by itself). It is easy to see that the 
following transformation is correct: 


DO SF(I,<1,2,...,M>) [S,] => 
TOG{DO SF(I,<1>) [S515 


DO SF(I,<2>) [s,]; (3) 


DO SF(I,<M>) [s,]}. 


The last TOG statement can be written more 
compactly as 


DO TOG(I,<1,2,...,M>) [s,] 


Transformations similar to the ones in 
Example 3.1 can always be done when no cycles are 
involved in the dependence graph of the T-block. 
This type of transformation may be called total 
partition. 


Sometimes it is possible to apply partition 
to Tblocks involving cycles as shown in the next 
example. 


Example 3.2 Consider the loop 


DO I=3,™M 
S.: A(I) = A(I-2) +1 
ENDO 


We can partition this loop using the follow- 
ing transformation 


DO SF(1I,<3,...,M>) [s,] => 
TOG{DO SF(I,<3,5,...5[5=1#241>) [S113 


DO SF(I,<4,6,...,151#2>) [s,1} 
| 


To be able to do transformations like the one 
in Example 3.2 (called partial partition), it is 
necessary to make use of the distance vectors. 
For a similar method called splitting, see 
[BCKT79]. Partial partition may not help in the 
case of single-instruction stream machines be- 
cause the cycles could involve IF statements. 


In the previous two examples, we considered 
only singly-nested loops. Generalization to 
loops with more levels of nesting should be obvi- 
ous. There is, however, a method that is useful 
in some cases for the partition of multiply nested 
DO loops. This method, loop interchange, is de- 
scribed in [CKTB79] and [Wolf78]. The methods of 
this section would be useful with the DOALL con- 
struct of [Burr79]. 


3.1.2 Algorithm Change If the method of 


partition does not work, this second method will 
be applied. The idea is to use as much informa- 
tion as possible from a T-block in order to detect 
what sort of algorithm is represented by it. 
if the T-block is recognized as a linear recur-. 


Thus, 


rence, the best parallel algorithm known for the 
particular target machine should be applied. 
Algorithms for linear recurrences have been widely 
studied, some results and references can be found 
in [Kuck78]. 


3.1.3 Loop Freezing When everything else 
fails, the T-block will have to be executed 


serially. In this case, the body of the T—block 
can be considered as a program segment and global 
transformations can be applied to it. 


Example 3.4 The following loop cannot be par- 
titioned, and no known algorithm can be applied to 


speed it up. 


DO I=1,M 


Sy? ACI) = AC(I-1) * A(I-2) * C(I-2) + X 
Soi DCI) = D(I-1) * A(I-1) * C(I-1) + Y 
S.: C(I) = C(I-1) * ACI) * DCI) + Z 


ENDO 


If we freeze this loop (i.e., serialize it 
and consider its body as a program segment), we 
obtain the following graph of dependences 


Now, applying to the body a global trans- 
formation (see Section 3.3), we obtain 
DO SF(I,<1,2,...,M>) {SF[TOGIS, 35,};s.]} » 
3.1.4 Overall Strate We conclude our dis- 
cussion on T-block transformations with a descrip- 
tion of the overall strategy. This is shown in 
the following algorithm: 


Algorithm 3.1 


Input: Tblock P of the form DO SF T{SF[S,3S,; 
55} 
Output: Execution structure 


If partition can be applied to P 
Then 


Transform DO SF T{SF[S,35,3--.35 1} 


to ToG{DO SF I, (SFIS, 3S,3.+.38 313 


DO SF I ISF{s,3S,3--.3S3)} 


For k = 1 to m apply Algorithm 3.1 


to DO SF 1 {SF[S,3--.3S 1} 


Else 
If algorithm change can be applied 
' Then apply algorithm change 
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Else apply global transformation to the 
body of P (i.e., to SF[S,3553--+3 
sil). = 


To see how the algorithm will work, we will 
use the following example. 


Example 3.5 Let us apply Algorithm 3.1 to the 


loop 
DO I= 1, MI 
DO J = 3, MJ 
Sy: ACI,J) = A(I,J-2) + 1 
ENDO 
ENDO 


Originally, S, is a T-block by itself with 


1 

the following representation 
DO-SECL, <1 523 4005Mi>): % (553 5454%444MI>) 
[s,] (5) 


When we apply Algorithm 3.1 to (5), we will 
have the following sequence of transformations 


Step 1 


DO SF(I,<1,2,...,MI>) X (J,<3,4,...,MJ>) [S)] 
J 

TOG{DO SF(I,<1>) x (J,<3,4,...,MJ>) [S,]5 

DO SF(I,<2>) < (J,<3,4,...,MJ>) [S,]3 


DO SF(L,<MI>) X (J,<3,4,...,MJ>) [sj ]} 


Step 2 


Applying the algorithm to DO SF(I,<k>) <x (WJ, 
<3,4,...,MJ>) [s)] for k e{1,2,...,MI}, we obtain 


MJ-1 


TOG{DO SF(1I,<k>)(J,<3,5,...,| 5 |*2+1>) 
[Sie (6) 
MJ 
DO SF(I,<k>) (J,<4,6,.--,15-1*2>) [s,]} 
as in Example 3.2. 
Step 3 
Finally, we can apply a linear recurrence 
algorithm to the DO SFs in (6). a 


3.2 t-Block Cluster Transformations 


In this section we will start by considering 
a cluster of 1-blocks with the following charac- 
teristics: 


1) All 1-blocks come from the same loop in the 


original source program; 


2) The DAG of dependences of the T-blocks in 


ll 


the cluster has the form of a chain; and 


3) 


The execution time of every iteration of 
any given T-block is constant. 


We will study how to transform this cluster, 
which we will call a simple cluster, to produce 
good machine code. Later, we will mention 
briefly how to extend the techniques studied to 
more general clusters. We will consider the case 
of a MES machine only. The case of a MEA machine 
can be treated similarly. 


Let us say that the statements in the cluster 
DP Sos seg so: One way of executing this 


cluster is the one shown at the end of Section 2.4, 
namely, 


DO SEQ T[SF[S,3553...35]]. 


are S$ 


Let us consider the consequences of executing a 
recurrence computation using this method. Suppose 
that statement 2 is of the form 


Kee 


: * A, +B... 
i i-l i i 


Assuming that statements S, through el had no 


1 
recurrence, all processors in an MES machine could 
reach statement ge within a short time interval of 


each other. However, 2 would be executed first 


on processor 1, then on processor 2, then on 
processor 3, and so on, with a gap of length O(m) 
in processor m as it waits for X-1° Thus the 


execution time for a program with a single recur- 
rence, assuming m loop repetitions and n state- 
ments, would be O(mtn) steps. Since the serial 
computation takes O(mn) steps, the efficiency 
(speedup/processors) is O(n/(mtn)). 


Now let us study an alternative scheme for 
executing such a computation, where the machine 
will execute the first T-block on processor 1, the 


second Tblock on processor 2, and so on. This 
computation can be represented by 
SEQ[DO SF ed are ae sea 
(7) 


DO SF ESE IS 22452 dds 


where S ae form the first T—-block and 


Each T-block is 


tet 
a sos s, form the last T-—block. 


executed serially by DO SF I; however, the en- 
semble of cluster repetitions is executed with SEQ 
control, so they proceed as much in parallel as 
possible. We call this method pipelining because 
a loop is broken into a number of smaller loops 
that may be chained together. Note that in this 
case, assuming that clusters are small, O(n) 
processors are used, and assuming that dependences 
exist across the entire set of processors, the 
computation time is O(mtn) steps (as it was above), 
but the efficiency is now O(m/(mtn)). Thus, pipe- 
lining tends to be more efficient when the number 
of loop repetitions is large, relative to the num- 
ber of loop statements. Of course, if we need not 
pay time for dependences across the entire array, 
then the pipeline time is O(m) so the speedup and 


efficiency increase. 


Henceforth, we will assume that the number of 
loop repetitions is large relative to the number 
of statements. We will, therefore, only consider 
pipelining. The application of what follows when 
the number of statements is large is immediate. 


When translating to an instruction controlled 
machine (as opposed to a data controlled or data- 
flow machine) we, for simplicity, will use a syn- 
chronization per iteration approach. The result- 
ing structure for the case of a singly-nested loop 
is shown next. 


parbegin 


m,: DO I= 1, ™ 
S,3 S53 ee S53 V(o,) 
ENDO; 
se DO I=1, M 
BAG ag)? Sys ee nays YMC) (8) 
ENDO; 
Th DO ILT=1,M 
P(o 3 S)3 lp is So 
ENDO 
parend 


Here, as before, S> Sy» .-+, S_ form the first 


T-block, S$ ..S, form the gth T-block, and S$ 


QR * k 
Ss, form the last T—-block of DO SF I[S,3---3S,]. 


The statements P and V are the well-known 
synchronization primitives, and the G5 = 1, 


--.-, m- 1 are semaphores. From the definition of 
a 1-block, it is easy to see that (8) is a correct 
transformation. 


In the segment of program shown 
in (8) we see that for large enough M the execu- 
tion time is dominated by the bottlenecks, which 
are defined as those T-blocks of maximum execution 
time. 


Given that pipelining can be applied to a 
cluster of tT-blocks, we are faced with the problem 
that local transformations can also be applied to 
the individual T-blocks. We now proceed to state 
an algorithm that integrates the local transforma- 
tions of 3.1 with pipelining. 


Algorithm 3.2 


Input: A simple cluster of T-blocks 


Output: The cluster transformed for parallel 


execution; and its execution time. 


Step 1 


Compute the execution time of the cluster 
when executed as a pipeline. To compute this time, 
we should attempt to decrease the size of the 
bottlenecks by applying loop freezing and global 
transformations to their bodies. Call the pipe- 
line execution time Tap? and the program struc- 
ture resulting PS , . 
pip 
Step 2 


Let us say that the T-blocks in the cluster 
» Ty, e+e, TN and that 7_ is the first 
1 2 m g 


bottleneck. 


are T 
Then we proceed as follows: 


1) Apply local transformations (Algorithm 3.1) 
to ie Let ve be the execution time of the 


resulting program structure and PS- the 
structure itself. 8 

2) Apply Algorithm 3.2 to the chain Ty» To» 

1° Let T, be the execution time of 

the resulting structur2 and PS, the struc- 

ture itself. 


iooees 1, 


3) Same as (2) but for chain tee very The 
Let T, be the execution time and PS, the 
resulting structure. 
Finally, let ws) = Ty + os + T. and a = SF[PS, 5 


Bes 


If T , < T_,, then the result is T .. and PS , 
pip — gk pip pip 


else the result is T and P 


ea gd’ ae 


Example 3.7 Consider the following loop: 


DO I= 1, 10 


S): A(I) = B(I) + C(I) 
S,: D(I) = ACI) +1 (10) 
S.,: E(L) = D(I) + 2 


ENDO 


Here each statement is a T-block by itself. 
If we only count arithmetic operations when com- 
puting the execution time, then we have T , = 


pip 
2+ 10 = 12. 


When Step 2, Algo. 3.2 is executed, Ss. (the 


first bottleneck) will be partitioned as a vector 
operation (total partition). The ex¢gution time 
when P processors are available is [S-l. 


Finally, if it is assumed that P > 3, then 


10 
> — = “ — 
ap 315 ] Toe Therefore, (10) should be exe 
cuted as a sequence of vector operations. a 
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Example 3.8 In the loop 
DO I= 1, 100 
S,: A(I) = BCI) + 1 (11) 
Sy? C(I) = C(I) * CCI-1) + ACI) * C(I-2) 
S,: D(I) = C(I) + 1 
ENDO 


each statement is a T-block. The pipeline execu- 
tion time (assuming as in Example 3.7, that only 
arithmetic operations count) is T =2+4x 
100 = 402. ea 


Applying Algorithm 3.2 to (11), we find that 
the bottleneck cannot be improved. Therefore, 
100 
= x ——- x < 
Toe 2 [ P ] +4 100, and re ay 
P > 3. We conclude that pipelining should be 
used for (11). r 


ag wHee 


In general, we will have to deal with more 
general cases than a chain of T-blocks. Some of 
the considerations required to extend our previous 
algorithm are given next. 

1) In general, the execution times of the T- 
blocks will not be constant. For example, 
if IF statements are involved, the execu- 
tion time could be random. 


In this case, we could be very precise when 
computing execution times if all probabilities are 
known. This computation can, however, become 
lengthy and furthermore, the probabilities are not 
always known. We should try, then, to use a 
simple value like maximum possible execution time 
when computing the overall execution time of the 
cluster. 

2) Cluster of t-blocks could have a DAG of 
dependences more general than simple chains. 
If the number of iterations of the original 
DO loop is much bigger than the number of 
statements, we could apply topological 
sorting [Knut73] to obtain a chain, and 
then apply Algorithm 3.2. However, if the 
number of iterations is comparable to the 
number of Tl—-blocks, a different algorithm 
should be applied. 

3) The t-blocks in a cluster do not have to 
originate in the same loop of the source 
program. By using techniques like fusion 
[AbKL79], tl—-blocks coming from different 
loops could become part of the same cluster. 


3.3 DAG of tT-Block Clusters Transformations 


Once the T-block clusters have been trans- 
formed, it remains to transform the whole DAG of 
Tt-block clusters. 


If the target machine is of the SEA type, 
topological sorting should be done on the DAG, 
and then we can execute the clusters sequentially. 
On the other hand, for MES or MEA computers as 
many clusters as possible should be executed 


13 


Simultaneously. 


Example 3.9 Consider the following DAG of 
clusters of T-blocks 


The antichains are C C. and C C, and C 


172 3° %4 
and Ce. Execution of this DAG in a MEA or MES 
computer, given that enough processors are avail- 
able, will be as follows: 


SF[TOG{C, $C, 3C,}5 TOG{C, 3C.3C,] 


5? 


In a SEA computer, the DAG could be executed as 
SF[C, 5C,3C,5C,3C,3C.] mA 
3.4 Transformations Dependent 
on Machine Particulars 


We now briefly consider some transformations 
that could be helpful in the particular machine 
we showed in Section 1. These transformations are 
intended as illustrative and are by no means ex- 
haustive. 


3.4.1 Data Transmission and Synchronization 


One of the goals in code generation should 
be to try to make as little use as possible of 
the global alignment network. Also, if the pro- 
cessors have general registers, we should try to 
use them instead of the local alignment network. 
Transformations to achieve this goal are not 
always straightforward. 


Example 3.10 Consider the following program 


DO I=1,N 


Sj: A(T) = BCI) + C(I) 
ENDO (12) 
DO I=1, N 
Sy: D(I) = A(I-8) + 2 
ENDO 
th 


If (12) is executed in such a way that the i 
iteration of Ss is executed in the same processor 


th , ‘ : ‘ 
as the it iteration of Ss it will be necessary 


2° 
to use the local alignment network (or even the 
global alignment network if the clusters are very 
small)for data transmission (and synchronization 
if the global control unit is not used). A better 


solution is to execute the i*® iteration of Sy in 


iteration of S,. 


the same processor as the (i+) D 


Then data transmission and synchronization are 
avoided. a 


Example 3.11 Suppose we want to pipeline the 
following DAG of T-blocks. 


If we have the machine of Fig. 1 with c = 2, there 
will be some allocations that will need use of the 
global alignment networks and others that won't, 
as shown in the next figure. 


Cluster 
Processor 
A good allocation 


A bad allocation 


3.4.2 Reclustering to Increase Efficiency 
of Pipelinin 


Let us start with one example. 


Example 3.11 If in the chain of t-blocks 


(12) 


One iteration of S. takes two units of time 


and Sy and Sy take one unit of time per iteration 


each, then if we execute (12) (using a canonical 
transformation like (8)) as 


SEQ{DO SF T[S,]; DO SF I[S,]; DO SF T{s,]}, (13) 


we will obtain an efficiency close to 2/3. How- 


ever, if we cluster Sy and So and execute as 
SEQ{DO SF T[SF{s, 35,3]; DO SF T[s,]} (14) 
we will obtain an efficiency close to 1. Notice 


that (13) and (14) will take the same amount of 
time. a 


The goal of the process of reclustering is to 
increase efficiency without increasing time. The 
complexity of the problem of finding the best 
possible clustering can be easily shown to be NP- 
complete; therefore, some heuristics should be 
used. 


3.4.3 Allocation Overhead 


If dynamic proces- 
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‘for array processors. 


sor allocation is allowed, the time to allocate a 
processor could have a determining effect on the 
degree of parallelism that can be used profitably. 
We have looked at a very simplified version of 
this problem and it turns out that even if allo- 
cations times are constant and an unlimited 
amount of processors are available, the problem 
is NP-complete [Padu79]. 


4. Conclusion 


Throughout this paper we have discussed 
methods of compilation for a high-speed multi- 
processor and we have referenced similar papers 
Table I summarizes some 
general conclusions about the performance of two 
types of machines. Two points that are frequent 
sources of difficulty in array machines are con- 
ditional branching inside loops and the accessing 
of arrays (or other data structures) in irregular 


ways. Both of these subjects will be treated 
here. 
IF Trees 


Conditional statements inside loops can fre- 
quently be turned into array tests for fast exe- 
cution [Bane79]. In fact, nearly all of the pro- 
grams analyzed in [KBCD74] achieved substantial 
SEA speedup despite IF statements inside DO loops. 
Nevertheless, in some cases, IF statements inside 
loops can present serious difficulties to fast 
SEA execution. 


For example, a loop with many, equally 
likely paths and relatively few iterations could 
lead to many, very short vector operations. [In 
such cases, it may be desirable to partition the 
loop for execution in a high-speed multiprocessor. 
Thus, each processor can follow a separate path 
through the loop as in each iteration of a serial 
machine execution. Synchronization may be needed, 
but it can also be traded for some redundant com- 
putation. 


Additional speedup may be achieved by equip- 
ping each processor with a parallel IF-tree 
evaluator [Kuhn79]. After the conditions of the 
tests are computed, the path through the decision 
tree is evaluated in time (gate delays) propor- 
tional to the log of the number of IFs in the 
loop (we are assuming there are many IFs). After 
the proper outcome is selected, final computa- 
tions are made for each iteration in its proces- 
sor. This technique may be viewed as an array of 
sequential computations, and this is indicated in 
Table I. 


Interconnections 


It is well known that the fastest possible 
processor-memory switch is the crossbar. How- 
ever, the cost of the crossbar switch for a large 
number. of processors and memories is quite high. 
Alternatives to the crossbar switch have been 
developed using the concept of perfect shuffle 
[Ston71]. 


Table I 


Array Processor Vectorization Multiprocessor Pipelining 
Across 
(sequence of arrays) (array of sequences) Multiprocessors 
1) Number of processors * Proportional to the number * Proportional to number * Proportional 


for maximum speedup 


of iterations of the loop. 


of iterations of the 


to the number 


loop. of statements. 
2) Speedup * High if q-blocks are vector * High * Proportional 
operations or if they are to number of 
linear recurrences. Equal processors. 
or worse than pipelining if 
this is not the case. 
3) Efficiency * High for vector operations * High * Over 70% in 


and linear recurrences with 


small band. 
other cases. 


For array machines, we could use the omega 
network [Lawr75] which allows fast access to rows, 
columns and diagonals of two-dimensional matrices 
when these are suitably distributed among memory 
modules. This same omega network could be used in 
multiprocessors as proposed in [Burr79]. 


Another possibility is the one-stage perfect 
shuffle with queueing on each comparator [Lang/76]. 
The queueing works well for array machines; how- 
ever, when requests to memory are random, as could 
be the case in a multiprocessor, the queues could 
become too long. An alternative to queueing we 
have been studying is to set at random the two in- 
put modules when conflict arises. One of the re- 
sults we have obtained is that, if two one-stage 
perfect shuffles are present in a system with n 
processors and n memories, the average delay for a 
request between processor and memory will take 


o(7n) Stages [Padu79]. This magnitude can be 
greatly decreased using other techniques. 
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Abstract -- The unifying effects of progress 
in language and computing theory, coupled with 
the advent of inexpensive microprocessor, has led 
many computer architects to consider the modular 
design of computer systems incorporating multiple 
microprocessors to implement various functions of 
the overall system, This paper is concerned with 


the parallel processing of high-level language 
programs by a multiple microprocessor system. 
The notion of a Parallel Execution String (PES) 
is first introduced as a representation of 
expressions for parallel execution, The PES 
approach is then applied to detect the 
parallelism at both the statement and the block 


levels, The advantage of the PES approach over 
the conventional "parallel by level" approach is 
discussed, and two algorithms are given to 
convert expressions into P&S's, Finally, the 
organization of a multiple microprocessor system 
designed for parallel processing of PES's is 
presented, Code generation and optimization 
techniques are also discussed, | 


I, Introduction 


The advent of inexpensive microprocessors 
has led many computer architects to consider the 
design of computer systems incorporating multiple 
microprocessors to implement various functions of 


the overall system [1]. Examples are the 
implementation of a general-purpose pipelined 
CPU [2], an emulator for the Sperry Univac 1108 


mainframe [3], the Cm* [4], the Multi Associative 
Processor (MAP) [5], the Distributed Function 
Multiple Processor (DFMP) [6], the 
direct-execution computer organization [7], etc. 
This paper is concerned with the parallel 
processing of high-level language programs, 


In this paper, we investigate the problems 
involved in parallel execution of arithmetic 
expressions in high-level programming languages, 
We are not concerned with the tree-height 
reduction techniques as proposed by Baer and 
Bovet [8], by Ramamoorthy and Gonzalez [9], by 
Squire [10], and by Stone [11]. Rather, we are 
dealing with a representation of parallelism, a 


way the expressions are executed, and an 
* Research reported herein was supported in 
part by grants AFOSR- 77-3400 and 


NSF-MCS- 77- 23496, 


17 


organization for carrying 
out the execution, The concept and notion of a 
Parallel Execution String (PES) is introduced as. 
a representation of expressions for parallel 
execution, The PES approach is then applied to 
detect the parallelism at both the statement and 
the block levels in Section lI and Section III, 
respectively. Two algorithms are then given to 
convert expressions into PES's and to_ schedule 
them for execution in Section IV. A machine 
organization suitable for carrying out parallel 
operations with our approach is described in 
Section V. Finally, code generation and code 
optimization techniques are discussed in Section 
VI and Section VII, respectively. 


appropriate computer 


II, Parallel Processing of Expressions 


It is well known that an expression can be 
represented by a rooted tree, with its internal 
nodes denoting operators and its external nodes 
variables and constants [12]. The son-nodes of a 
node are the operands of that node, For two 
operator nodes in the tree, if neither is the 
ancestor of the other, these two operators are 
independent and thus can be executed in parallel. 
In order to take advantage of the parallelism 
during the execution of an expression, there 
should be an intermediate form, into which the 
expression can be transformed, that shows the 
parallelism explicitly. One method (i.e. the 
conventional parallel-by-level approach [13]) is 
to group together the operations that are at the 
same level in the tree, and then to execute the 
operations in a group in parallel. The result of 
each operation is represented by some external 
symbol, which is not in the expression, and _ the 
symbol is then used as the operand of some 
operation on a higher level. The implication of 
this method is that we should have assumed that 
all operations take equal length of time, for 
otherwise there will be instances that the 
executions of some operations are delayed 
unnecessarily. However, this assumption does not 
hold for most real computers, Thus we propose 
another scheme to represent an expression in such 
a form that is appropriate for parallel 
executions regardless of the execution time of 
various operations, Later we will see that this 
scheme also has some additional advantages, 
First, let us give some definitions: 


definition 


In an expression tree, an operator node is 
called 
type 1 -- if all of its operands’ are 
variables or constants; 
type 2 -- if exactly one of its operands is 
an operator; and 
type 3 -- if more than one of its operands 
are operators, 
If we consider only unary and binary 
operators, then the definition of type 3 
becomes: 
type 3 -- if it has two operands being 


Operators, 


For simplicity reasons, from now on we will 


only consider unary and binary operators, 
However, the proposed scheme can be~ easily 
extended to handle operators with more than two 


operands, 


If we consider the type 1 nodes as_ starting 


points toward the root of the tree, then there 
are as many paths as type 1 nodes. Each path 
passes through a sequence of operators and 
uniquely defines a string of operators and 


operands, starting at a type 1 node and ending at 
the root, The string is to be called a 
Parallel Execution String (PES). These paths 
merge together on their ways toward the root and 
eventually converge at the root, where the last 
operation is performed, Note that each path has 
a type 1 node at one end and the root of the tree 
at the other end, and all of the intermediate 
nodes on each path are of type 2 or type 3. Type 


3 nodes are merging points of paths, whereas’ the 
others are of type 2. 

We observe that for each path, all the 
operations on that path have to _ be executed 


sequentially, beginning from the starting node 
and heading toward the root. However, any two 
nodes on two different paths prior to the merging 
point of these two paths (even though they are at 
different levels), are independent and thus can 
be executed in parallel, From these observations 
we can see that there exists more parallelism 
among the PES's of an expression than what can be 


exploited by the parallel-by-level approach. In 
a formal way, we can define a PES as follows: 
Definition 
A Parallel Execution String (PES) of an 
expression is a sequence 
Dy Ty Dg Te «e+e Dn-1 Tn-1 Dn 
where 
(1) Ty,T9,....Tp-1 are operator nodes in 


the tree, Tjis a type 1 node, To9,..., 
Tp-1] are of type 2 or 3, and Ty-j is 
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the root node; 


father 


(2) 


for isl,2,...,n-2, Ti+1 is the 
node of Tj; 


(3) 
(4) 


Dj and D2 are operands of Tj; 


for every i such that Ty is of type 2, 
Di+1] is either empty or an operand of 
Ti, depending upon whether Ty is an 
unary or binary operator; 

(5) for every i such that Tj is of type 3, 
Di+1 ais represented by "#k'', where k 
is a number uniquely identifying the 
node Ty among all the type 3 nodes; 
and 

(6) if Ty corresponds to a non-commutative 
operator X, and Tyjy.1 is the right 
son-node of Tj, then Ty is represented 
by X', 


representing an 
(In Section IV, we 
converting an 


Figure 1 is an example of 
expression in PES notations, 
will present an algorithm for 
arithmetic expression into PES's,) 


The expression -(A+G+B*C) /(D*(E+1)+F)+H 
can be represented as a tree: 


CE I 
It can be compiled into PES's as: 


A+G+#l1- / #2 +H 

B* C+ #1 - / #2 +H 

E+I* D+F /' #2 +H 

Fig. 1 Example of translating an expression 
into PES's 


With the definitions above, we thus propose 
a scheme to decompose an expression into PES's 
and execute them in parallel as_ follows, 
Whenever a processor is free, it will pick up one 
of the PES's that have not yet been taken, and 
start executing that PES from left to right. The 
first operation is always of type 1, which means 
all of its operands are variables or constants, 
so that it can be executed immediately, If it is 
a umary operator, it performs the operation and 
keeps the result in the processor for the next 
instruction, If it is a binary operator, it 


loads the first operand into the processor, 
performs the operation on the first and second 
operands, and keeps the result for the next 
instruction, 


When the processor reaches a 
nothing will prevent it from performing that 
operation, Because this node has exactly one of 
its operands as an operator which has already 
been executed immediately prior to this node by 
the same processor, Note that the result of the 
previous operation is still kept in the processor 


type 2 node, 


and can be readily used as the operand for this 
type 2 node. The processor will execute the type 
l and type 2 nodes in the PES one by one, 


independent of other PES's, 


When the processor reaches a type 3 
operator, i.e. the operator with a #k operand, it 
will either continue executing the PES or save 
the partial result obtained thus far and then 
give up the PES, depending upon whether or not 
any of the other PES's, passing through the same 
type 3 node, has been executed up to this node, 
One possible machine organization to implement 
this scheme is described in Section V, Each PES 
will be executed only by one processor, although 
the processor may give up that PES before it 
reaches the end. 


As mentioned earlier, the PES scheme has the 
advantage that it can exploit all the parallelism 
within an expression regardless of the operations 


that may take unequal length of time. In 
addition we also find that it has the following 
advantages: 
1, The execution of a PES is done 
straightforward from left to right. No 
precedence relation between the operators 


needs to be considered, and it does not need 
any stack, 


If the operations in the expressions are 
limited to unary and binary operations, it 
only needs a one-address instruction to 
perform the operation for each of type 2 and 
type 3 nodes, and two one-address 
instructions for each of type 1 nodes, 


The partial results need not be stored for 
any type 1 and type 2 nodes, and they remain 
in the processors and will be used in 
subsequent operations, Even though the 
partial result may have to be stored for a 
type 3 node, it occurs only when the other 
path has not yet reached that node, so that 
storing the partial result in this case will 


not really increase the total execution 
time, 
4. When a PES is assigned to a processor, a 


sequence of operations will be performed by 
the processor without the intervention of 
others, so that the execution can be done as 
fast as possible, This also makes it 
possible to employ some techniques (e.g., 
pipelining instruction fetching, decoding, 
and execution) to increase the execution 
speed further. 


As will be seen in the next section, the PES 
scheme allows detection of parallelism 
between the operations in different 
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statements, 


III. Parallel Processing of Statements 


It can be argued that expressions in most 


programs tend to be short and hence the scheme 
described above will not speed up the execution 
too much, Therefore, we need to go a step 


further to investigate how the PES scheme can be 
applied to exploit the parallelism among a block 
of statements, 


To execute two statements in parallel, a 
condition must be satisfied: the input set and 
the output set of either statement cannot have 
any variable in common with the output set of the 
other statement, There are two methods that we 
can use to exploit the parallelism between 
statements to make the PES scheme more practical. 
The first method is to make the’ statements 
independent of each other by using a_ technique 
called forward substitution [12]. After applying 
forward substitution, the expressions become 
independent of each other and usually become more 
complicated and have more parallelism to be 
exploited, so that we can use the PES scheme to 
execute the statements in parallel, 


In this paper we propose another method to 
detect the parallelism between the operations in 
different statements. Our approach is based on 
the concept introduced in Section II that an 
expression can be represented by one or more 
Parallel Execution Strings (PES's). The PES's of 
the statements in a block are tested for 
dependency and scheduled for execution according 
to the same condition mentioned above for 
statements, but here the tasks being tested and 
scheduled are the PES's generated from the 
statements, instead of the statements themselves. 
The advantage of this scheme is apparent in the 
example below. In the scheduling process, each 
PES will be assigned to a specific execution 
stage, Program execution will be done stage by 
Stage. Those PES's which are assigned to _ the 
Same stage can be executed in parallel. 


The following is an example for this scheme. 
The original program consists of three assignment 
statements: 


X := A*B* C+D 
X := C* X / (D+E* (F - G)) 
A:= D* E+Cx*B 
The PES's for these three statements are: 
A* B* C+ D-—- X 
C* X / #1 +X 
F-G*e*E+D/' #1 —~xX 
D* E + #2 —»A 
C* B+ #2 —>A 


These PES's will be scheduled to two stages for 
execution: 


A*x* B* C+D-—-> xX 
F+-G*e* E+D/' #1 
Cx B+ #2 

Stage 2 


From the example above we can see that this 
scheme has the advantage that, even though 
Statement 1 and Statement 2 are not independent, 
a subexpression of Statement 2 can be executed 
concurrently with Statement 1, Similar situation 
also exists between Statement 1 and Statement 3, 


IV. Compiling Algorithms 


In this section we will present two 
algorithms: Algorithm A is to convert an 
arithmetic expression into Parallel Execution 
Strings, and Algorithm B is to detect the 
parallelism between PES's in a block and to 
schedule them for execution, Algorithm A 
requires only one pass through the input 
expression and each PES corresponds to one of the 
execution paths described in Section It, 
Algorithm B is applied to the PES's in a block 
one by one, 


For simplicity reasons, Algorithm A will 
assume that the input expressions have been 
translated into reversed Polish strings. 
However, the algorithm can be easily modified to 
accept arithmetic expressions without any 
preprocessing. Figure 2 is the flowchart of 
Algorithm A. Since scanning a reverse Polish 
string corresponds to the post-order traverse of 
the expression tree, a stack is used in Algorithm 
A to keep the operands before their operators are 
scanned, If the operand stored in the stack is 
an operator, it will be represented as $n, where 


n is the highest numbered PES passing through the 


operator node, 


When a variable or constants is scanned, it 
is always pushed onto the stack, When an 
operator is scanned, it will be handled according 
to its type. In Step 5 of Algorithm A, binary 
operators are processed in Case 1 through Case 3, 
for type 1 through type 3 nodes, respectively. 
For type 1 nodes (Case 1), a new PES is generated 
for the operation, and the operands on the stack 
are replaced by the string number, For type 2 
nodes (Case 2), the operator and the constant or 
variable operand are appended to each of the 
PES's passing through the operator node, and the 
operand is deleted from the stack, For type 3 
nodes (Case 3), the operator and a "#i" are 
appended to each of the PES's passing through the 
operator node, where i is a unique integer for 
the type 3 node, and the top two elements on the 
stack are replaced by the larger of the two, 
Unary operators are processed in Step 6, There 
are only two possible types for unary operators: 


type 1 and type 2, For type 1 nodes (Case 1), a 
new PES is generated for the operation and the 
operand on the stack is replaced by the. string 
number, For type 2 nodes (Case 2), append the 
operator to each of the PES's passing through the 
operator node, No changes to the stack will be 
made, | 


Algorithm A 


1, Convert the expression into a reverse Polish 
string, This procedure can be found in 


Hamblin [14] and is omitted here, Here we 
may use the tree-height reduction 
techniques [8-11] to obtain a modified 


Polish string. 


2. Initialize i — 1, j e 1, where 
i is an index for temporary storage and 
j is an index for generated strings, 


3. From left to right scan the Polish string 
for the next symbol S, 
If it is the end of string, the procedure is 
done, and String l, String 2, ‘aula 
String (j-1) are outputs, 


4, If the symbol S is an operand, push it onto 
the stack, then go to Step 3. 


5. If the symbol S is a binary operator, the 
top two elements on the stack have the 
following possibilities: 


case 1 Both are operands: 

1) Create a new String j. 

2) Pop the top element off the stack, and 
let it become the third symbol of 
String j; then let the operator S be 
the second symbol of String j, pop 
stack again, and make it the first 
symbol of String j. 

3) Push the symbol $j onto the stack, 

4) j a jel. 

End case 1 


Case 2 One is an operand, and the other is 
a Sk: 

Let e be the number such that Se is. the 
next $'s appearing in the stack below 
Sk. If no such $e exists, e is 0, 

1) If $k is on top of the stack and the 
operator is not commutative, append the 
"reverse operator" of S to each of: 
String k, String (k-1), eo-e5 
String (et+l). Otherwise, append the 
operator S to each of the same set of 
strings. (The "reverse operator' means 
that the order of its operands is 
reversed). 

2) Pop up the stack twice. Append the one 
which is an operand to the same set of 
strings as inl, 

3) Push Sk onto the stack, 

End case 2 


Case 3 Both of the top two elements on the 
stack are $'s: | 
Let the top one be $n, the second one be 


Sm, andm<«n, 


Let e be the number such that is the 


Se 


next S's appearing in the stack below 
Sm. If no such Se exists, e is 0. 

1) Append the operator S to each of: 
String m, String (m-1), Seleeg 
String (e+l). 

2) If the operator S is commutative, 
append it to each of: String n, 
String (n-1), Saas String (m1). 
Otherwise, append its reverse operator 


to each of these strings, 
Append a symbol #i to each of: 
String n, String (n-1), ..., String m, 
..., String (etl). 
Pop up the stack twice, 
onto the stack, 

5) i = itl. 
End case 3 


3) 


4) and push $n 


Go to Step 3. 
If the symbol S is a unary operator, there 
are two possible cases: 
Case 1 Top of the stack is 
1) Create a new String j. 
2) Pop the top element of 
it become the first symbol 
Let the operator S become 
symbol of String j. 
4) Push $j onto the stack, 
5) j = jrl. 
End case 1 


an operand: 


the stack, Let 
of String j. 


3) the second 


| Case 2 Top of stack is Sk: 


Let e be the number such that is the 


Se 


next $'s appearing in the stack below 

Sk. If no such Se exists, e is 0, 
Append the operator S to each of: 

String k, String (k-1), ehaeas 


String (e+l). 
End case 2 


Go to Step 3. 


End of Algorithm A 


Figure 3 shows the flowchart of Algorithm B, 
Algorithm B will detect the parallelism across 
statements based on the PES scheme, It uses a 
symbol table to keep track of the variable names 
used in the block. For each variable in the 
symbol table, there are two fields associated 
with it: LAST-FETCHED and LAST-STORED. We will 
use LF(X) and LS(X) to denote the two fields 
associated with variable name X, These two 
fields contain the stage mumbers at whicha 
variable name was last fetched and last stored, 
respectively. Any PES's changing the variable X 
will be scheduled for a stage later than the 
larger of LF(X) and LS(X), because they cannot 
change the value of X until the prior fetching 
and storing operations of X are completed. Case 
1 in Step 1 is to handle this situation, Any 
PES's using the variable X as input will be 
scheduled for a stage later than LS(X), because 
they have to wait until the storing operation of 


yaa | 


X is completed, This situation is handled by 


Case 2 in Step l. 


In Algorithm B, we use an array TMP whose 
size is at least the maximum number of temporary 
storage elements used in a block, TMP(K) keeps 
the largest stage number of the scheduled PES's 
that contain the symbol #k, TMP is used for 
eliminating unnecessary conflict checking and for 
assigning sub-expressions to the earliest stage 
pos sible. During the scanning of a PES, if we 
find any temporary storage symbol which has been 
scheduled before, there is no need to continue on 
the current PES, This situation is handled by 
Case 3 in Step 1. During the scanning of a PES, 
a variable STG is updated to the largest stage 
number in which variable conflicts will prohibit 
the execution of the current PES, When it comes 
to Step 2, the current PES has been determined 
not to be executed in or before stage STG. 
Therefore, the PES is scheduled for stage STG+l., 
In Step:3, the table entries of LF, LS, and TMP 
are updated to reflect the results of scheduling 
up to the current PES, 


Algorithm B 


0) Clear the TMP array, 
Apply the following steps to each of the 
PES in the block one by one, 
Set STG to 0. 

1) Scan the PES from left to right. If it 


reaches the end of the PES, go to Step 2. 
Otherwise, get next symbol S. 


If S is an operator, ignore it and get’ the 
next symbol, 
If it is an operand, there are _ three 


possible cases: 


If S is the output variable of the 
assignment: 

1.1) STG «—- max[SIG,LF(S),LS(S)] 

1.2) go to Step l. 
End_case 1 


Case l 


If S is an input variable of the 
assignment statement: 

1.3) STG « max[STIG,LS(S) ] 

1.4) go to Step l,. 


End case 2 


Case 2 


If S is a temporary for 
partial result, say #k: 

If TMP(k) = 0, go to Step l. 

If TMP(k) > STG, go to Step 2. 

If 0 < TMP(k) $ STG, then 

S <— end of string, 

go to Step 2. 


End case 3 


Case 3 storage 


2) STG « STG+1. 


For each operand T to the left of S in the 
string, do the following: 
If T is the output variable 
assignment, 

LS(T) «— STG. 


3) 


of the 


If T is variable of § the 
assignment, 

LF(T) <— max[STG,LF(T)]. 
If T is a temporary storage for the partial 
result, say #j, 


TMP(j) «— SIG. 


an input 


End of Algorithm B 


V. Machine Organization 


A possible machine organization for parallel 
processing of PES's is shown in Figure 4, In 
this system, there is a variable number of 
identical microprocessors, Each microprocessor 
has its own program counter PC, accumulator AC, 
busy bit indicator B, and ALU, It is capable of 


fetching, decoding, and executing the 
instructions stored in the main memory. At the 
completion of a non-branching instruction, the 


microprocessor increments its own program counter 
and starts executing the next instruction fetched 
from the main memory. The instruction set of 
each microprocessor will have a_— special 
"operate-or-store"’ type of instructions, which is 
an ordinary operator except that the execution 
will depend upon the condition of the operand. 
If the operand in the PR memory (to be explained 


below) shows a ‘not ready’ condition, the 
execution will not proceed any longer, the 
microprocessor will become free, and its B 
indicator will be reset to 0, Instead of 
performing the operation, it simply stores the 
content of its AC into the storage location 
addressed by the operand field of § the 
instruction, If the operand is ready, the 
instruction will be performed as an ordinary 


operation, Instructions of this type are used 
for binary operators of which both of their 
operands are the results of some other operators, 
i,e, the type 3 nodes in the tree representation, 


There is a Partial-Result (PR) memory in the 
system, Its purpose is to. store the partial 
result obtained by the microprocessor’ that 
reaches a Type 3 node first, Each location in 
the PR memory has a status bit which indicates 
the availability of the partial result, The 
status bits are reset initially. The first 
"operate-or-store'' instruction ac essing a PR 
location will store its partial result and _ set 
the status bit, The second "operate-or-store" 
instruction ac essing the same location will use 


its content and reset the status bit. While a 
microprocessor is executing the 
“operate-or-store'’ type instructions, no other 


microprocessors will be allowed to access’ the 
same location in the PR memory. This is to 
insure that only one of the two operands for a 


type 3 node will be stored into the PR memory, 


In the 
"Entry-Point-List" 
of pointers pointing to the 


system, there is an 
(EPL) memory, which consists 
Starting points of 


each PES, There is a pair of registers "Front" 
(F) and "Rear" (R), and an indicator 
"Need-Processor'' (NP), F and R contain pointers 
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to the EPL memory. Before an execution stage 
starts, F points to the beginning of the EPL for 
that stage, and R points to the beginning of the 
EPL for the next stage, F is incremented by one 
when a PES is taken by a processor for execution, 


The indicator NP indicates 1 when F is not equal 
to R, and O when Fe R, Thus NP indicates 
whether or not the stage needs any more 


processors for execution, 


There is an FR memory to store the pointers 
that will be loaded into F and R throughout the 
execution, The ‘"Central-Program-Counter' (CPC) 
register is a pointer pointing to the FR memory 
for the current execution stage, Changing the 
execution sequence is done by changing the 
content of the CPC register, 


detect the completion of an 
execution stage, a "Number-of-Parallel-Strings" 
(NPS) register is used, When new values are load 
into F and R, NPS is set to a value equal to the 
difference between the values of R and F. It is 
then decremented at the completion of each PES, 
The completion of an execution stage is indicated 
by a zero in the NPS, 


In order to 


The control sequence for each microprocessor 


can be summarized as follows: 
O. Idles. When NP=1, go to l. 
1, PC = EPL(F), F = Fel, Be 1. 
2. Fetch, decode, and execute instructions, 
Repeat until it encounters an IDLE 
instruction or the operand of an 


"operate-or-store'" instruction is not ready. 


3. NPS e NPS-1, B = 0, go to OQ. 
The control sequence for CPC, FR, F, R, and 
NPS can be summarized as follows: 
O., R = FR(CPC), CPC — CPC+1, 
1, F  R, R ©FR(CPC), CPC «= CPC+l, 
2. NPS «+ R-F. 
3. When NPS=0, go to l, 
The design of a multi-microprocessor system 
using Am2900 bit-slice microprocessors can be 


found in [15]. The system has the capability of 
performing parallel operations with the PES 
approach, and the capability of multi-processing 
sequential programs. It also has an efficient 
and flexible interrupt handling mechanism, The 
memory system is also designed to match the high 
throughput of the multi-microprocessor system, 


VI. Code Generation 


several sequences of 
Each sequence of 
the PES's 


For each expression, 
instructions will be generated. 
instructions corresponds to one of 


‘generated by Algorithm A described above, which 


also corresponds to one path in the tree 
representation, The last instruction of each 
sequence is always an IDLE instruction, which 


will set the processor free, However, if there 


is only one sequence in each of two consecutive 
execution stages, the first sequence will not 
have the IDLE instruction at its end, This is to 
eliminate the overhead in rescheduling processors 
in the case that only one processor is needed in 


each of the two consecutive execution stages, 


For each sequence of instructions, an EPL 
entry, which is the address of the first 
instruction in the sequence, is generated in the 
EPL memory. For each execution stage, an FR 
entry, which is the address of the first EPL in 
the execution stage, is generated in the FR 
memory, Again, if there is only one sequence in 
each of two consecutive execution stages, there 
will be no FR entry generated for the _ second 
stage, This code generation scheme will insure 
that a strictly sequential program can be 
executed by a multi-processor computer system as 
fast as by a uni-processor computer system, while 
the programs with parallelism exploited by the 
PES scheme will be executed by a multi-processor 
computer system faster than by a uni-processor 
computer system, 


The translations from the PES's to machine 
instructions are straightforward. It can be 
summarized as follows: 

1, The first symbol in the PES is always an 
operand, Generate a LOAD instruction to 
load that operand into the accumulator, 


Continue scan the PES from left to right and 
do the following steps. 


If it is a unary operator, generate an 
instruction to perform that operation on the 
content of the accumulator, The result of 
the operation will be in the accumulator, 


If it is a binary operator, generate an 
instruction to perform the operation on the 
next symbol, The operation has an implied 
operand which is in the accumulator, and the 
result is stored in the accumulator, 


In step 3, if the next symbol is a numerical 
symbol preceded by a #, then the instruction 
generated is a special “operate or _ store"! 
instruction, 


For example the PES: A+G* #1 / #2 + H will be 


translated into: 


LDA A sload A into accumulator 
ADD G sadd G to accumulator 
MOS #1 smultiply #1 to ACC or 
; store ACC to #1 then 
; idle 
DOS #2 sdivide ACC by #2 or 
; store ACC to #2 then 
; idle 
ADD 4H sadd H to ACC 
IDLE send of PES, free 


; the processor 
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VII. Code Optimization 


There are two types of code optimization 


that can be done with the proposed compiling 
scheme, The first one is to eliminate the 
redundant portion of a PES to save space. When 
the PES's generated from an expression are 
scheduled to be executed in more than one 
execution stage, the redundant portions of some 
PES's can be eliminated as follows. If a PES 


contains a partial result symbol (i.e. the symbol 
preceded by a _ '"#'') which also appears in some 
PES's scheduled for a later execution stage, then 
the substring to the right of that symbol can be 
eliminated. That substring is redundant because 
the execution of the PES will never proceed 
beyond that partial result symbol. 


The other type of optimization is to reorder 
the PES's to minimize the total execution time. 
If the number of processors is equal to or 
greater than the number of PES's in an execution 
stage, the reordering of the PES's will not 


affect the total execution time. But if there 
are more PES's than processors, it might be 
advantageous to reorder the PES's, The tasks to 


be scheduled are the PES's in an execution stage. 


The execution time for each PES can be estimated 
by adding up the execution time for all the 
operators in the PES, The goal is to find a 


optimal non-preemptive m-processor schedule for 
the PES's in an execution stage to minimize the 
total execution time. However, it is known that 
the problem of finding the optimal non-preemptive 
schedule for n independent tasks with unequal 
length executed by m processors is NP-complete 
even for m 2 [16]. The problem is complicated 
even more by the fact that reordering the PES's 
will vary the execution time for each PES. 
Therefore, we can only find some near optimal 
scheduling strategies, A simple and intuitively 
sound strategy for this problem is _ the 
longest-processing-time scheduling, which gives 
the tasks with the longest processing time the 
highest priority. Simulations based on randomly 
generated PES's show that the longest-processing- 
time schedules have near-optimal performance, 
despite that the estimated execution time for a 
PES is usually not its actual execution time, 


VIII, Conclusion 


In this paper, we have first proposed the 
PES scheme for compiling the expressions in 
high-level language programs into an intermediate 
form suitable for parallel processing. We then 
went a step further to apply the scheme to a 
block of statements to exploit more parallelism, 
The advantages of this approach were briefly 
discussed, It should be noted that the PES 
approach we have proposed does not preclude the 
use of the tree-height reduction techniques and 
other techniques of program analysis for parallel 
processing [12]. 


Note that the PES scheme can be applied to a 
system with conventional machine instruction set 


as well as the indirect-execution high-level 
language computers [17]. In the latter, the PES 
notation is used as an intermediate language 
which is ready for parallel execution by the 
hardware, The translation from the source 
language to the intermediate language is not 
complicated, The intermediate language clearly 


indicates the parallelism exploited in the source 
language programs, 


We also presented the organization of a 
multi-microprocessor system suitable for parallel 
processing of PES's, The code generation and 
optimization techniques for such a system were 
also discussed, The proposed machine 
organization and code generation method have the 


advantage that it minimizes the overhead of 
executing a strictly sequential program or 
program segment while the programs’ with 


parallelism can be executed faster on such a 
multi-processor system compared with a 
uni-processor system, The optimization 
techniques described here can be used with other 
machine-independent optimization techniques in 
compiling programs, 
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converting the 
expression into 
a reverse Polish 
string 


End of string 
get next 
symbol S STOP 


Operand 


Unary 
Operator 


Binary 
Operator 


Append S and the 
operands to each 
of the PES's 
passing thru the 
type 2 node. 
Delete the operand 
from the stack. 


Append S and '#i'"' 
to each of the PES 
passing thru the 
type 3 node. 

Delete the smaller 
of the two operands 
from the stack 


generate a new 
PES j. Replace 
the operand on 
the stack by $j 


Append $ to each 
of the PES's 

passing thru the 
type 2 node 


Generate a new 
PES j. Replace 
the operands on 
the stack by $j 


push S onto 
the stack 


Figure 2 Flowchart of Algorithm A 


Microprocessor 


End of block 


End of PES 


get next 
symbol S 


Temporary 
Storage 


siicroprocessor 


Output 
Variable 


If STG is less 
than either LF(S) 
or LS(S), update 
STG to be the 
larger of the two 


If STG is less 
than LS(S), 
update STG to 
be LS(S). 


appeared 
before? 


STG+1 is the 
stage number for 
the current PES 


update LF, LS, 
TMP entries for 
the variables 

appearing in the 
current PES 


Figure 3 Flowchart of Algorithm B Figure 4 Machine Organization 
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A FLOW ANALYSIS PROCEDURE FOR THE TRANSLATION 
OF HIGH LEVEL LANGUAGES TO A DATA FLOW LANGUAGE* 


Stephen J. Allan** 
Arthur E. Oldehoeft 
Department of Computer Science 
Iowa State University 
Ames, Iowa 50011 


Abstract -- A data flow analysis procedure is 


described which may be used in the translation of 


high level languages to parallel target languages. 
The technique analyzes the data dependencies which 
exist between statements in a high level program 
and constructs an intermediate form amenable to 
optimizing transformations and code generation. 
An example illustrates how information provided by 
the analysis may be used in generating code for a 
highly parallel data flow machine. Within the 
framework of the described data flow analysis pro- 
cedure, extensions to the high level language are 
discussed which allow for higher utilization of 
the data flow machine. 


Introduction 


The user acceptance of a data flow computer 


to some extent will be influenced by the ability 
of the user to program application programs in a 
high level language. This calls for a translator 
to translate the high level language to a data 
flow language. This paper describes a technique 
for data flow analysis which may be used in this 
translation. The technique is useful for a broad 
class of high level languages which include the 
sequential von Neumann type high level languages 
in common use as well as nonsequential high level 
languages such as the so called single assignment 
languages [1,6,7,16]. In order for this flow 
analysis technique to be applicable to single 
assignment languages, it is required that the 
definition of a value precede any use of that 
value in the text of the high level program. In 
some single assignment languages this is required 
by the definition of the language while others 
would require a preprocessor to topologically ord- 
er the statements. 


The target language could be the instruction 
set provided by any of a variety of parallel ar- 
chitectures. In this paper, however, the assumed 


*Research reported herein was supported by the 
National Science Foundation under NCS77-02467 and 


by the Sciences and Humanities Research Institute 
of Iowa State University. 

**Present address: Computer Science Department, 
Colorado State University, Ft. Collins, Colorado 
80523. 
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‘that target code may be generated to 


target machine is a highly parallel data driven 
machine [5,9,15,17]. The underlying assumption 
behind a data driven machine is that a program is 
not a sequence of instructions that cause changes 
to a memory space, but instead a program is a col- 


lection of computations related to each other by 
the need for data values that are produced and 
consumed. The order of execution of the computa- 


tions is not directly stated by the 
rather by the partial 
data dependencies. 
analysis 


program but 
ordering provided by the 
The purpose of the data flow 
is to determine this partial ordering so 
exploit the 
inherent parallelism in the program. While other 
techniques [2,4,10,11] provide a general basis for 
the analysis, additional information must be gath- 
ered to generate code for such a parallel execu- 
tion environment. 


A compiler using the data flow procedure 
described in this paper has been implemented to 
translate programs written in a typical von Neu- 
mann type high level language to the language of a 
simulated data flow machine [14]. The discussion 


of the flow analysis technique is presented with 
this implementation used as an example. Some 
extensions to the language are then discussed 


which rely on this same technique of flow analysis 
but allow higher utilization of a data flow 
machine. 


High Level Language and Data Structure 
for Internal Representation 


A high level programming langauge, devoid of 
features incompatible with the notion of func- 
tionality of programs, was designed as a vehicle 
to help achieve the objectives of the research 
[14]. The language is similar in appearance to 
Pascal but, at the moment, has about the same 
expressive power as Algol 60. A program consists 
of a main procedure with declarations, including 
the definitions of other procedures and functions, 
and a body of statements. Procedures and func- 
tions may be recursively applied. Statements in 
the language include assignment, conditional 
(i.e., if-then and if-then-else), iterative (i.e., 
while-do, repeat-until and for), procedure call, 
and input and output. The for statement is 
translated directly into a while-do statement and 
no further reference to it is made in the paper. 


Parameters of procedure calls and definitions must 
‘ have 


eet 


an in and/or "out" directionality 


attribute. Integer, real and boolean data _ types 
are currently supported along with a full comple- 
ment of operators and intrinsic functions which 
can operate on identifiers declared with the above 
data types. The array is the only data  structur- 
ing facility which exists at the present time. An 
array may have any number of dimensions and may be 
dynamically declared at run time upon procedure 
entry. Transfer of control (e.g., goto's) and 
global references are not allowed in the language. 


The compiler translates the source text of a 
program written in the high level language into an 


intermediate form. This intermediate form is 
recorded as a table of relatively high level en- 
tries and is hereafter referred to as the IFT. 


Initially, the IFT is a representation of the tree 
structure of the high level program. After later 
phases add data flow information, the IFT becomes 
a representation of a data flow graph, amenable to 
optimizing transformations and code generation. 


Each entry in the IFT consists of four major 
fields as shown in Figure l. 


mes [o [me 


Figure 1 Entry in the IFT 


TYPE is the field that indicates the type of 
statement represented by the entry; I is the set 
of input values for the entry; 0 is the set of 
output values for the entry; and TREE is the syn- 
tax tree for the entry, if one exists. 


A separate IFT is generated for every pro- 
cedure and function defined in the program. 
Entries are created in the IFT during the parse 


phase and are threaded to represent the ordering 
of the statements as they were encountered in a 
sequential scan of the high level program. Each 


high level statement results in the generation of 
one or more entries in the IFT where the data flow 
information is maintained. The general form of 
the different high level statements and the types 
of entries in the IFT generated by the compiler 
are given in Figure 2. A simple high level state- 
ment (i.e., assignment, procedure or function 
call, and procedure or function heading) generates 
only one IFT entry in which the data flow informa- 
tion for that statement will be maintained. For 
compound statements that are conditionally execut- 
ed (i.e., bodies of while, repeat and if con- 
structs) an "interface" entry is generated to 
maintain the cumulative data flow information for 
the condition and block of statements within the 
body. These interface entries are denoted in Fig- 
ure 2 by "if", "while" and "repeat". An interface 
represents a staging area for the values used by 
the condition and block and for the values defined 
by the block. All information used by the block 
is conceptually passed from preceding statements 
through the interface and all information defined 
by the block is conceptually passed to succeeding 
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statements through the interface. This allows for 
local flow analysis of blocks of statements. A 
"close" or "end" entry is generated to mark the 
end of a repeat, while, if, procedure or function 


block and contains no data flow information. The 
"then" and "else" entries are also generated to 
mark the start of the then and else bodies. The 
input /output statement generates an 
"input"/"output" entry in the IFT for each com- 
ponent in its list. If an input or output com- 


ponent involves an implied do loop, a while loop 
is generated with corresponding entries placed in 
the IFT. 


High level statement Entries in the IFT 


procedure (function) 
entries for statement 
list 
end 


procedure (function) 
statement list 
end 


input/output al,...,an input/output for al 


input/output for an 


x := expression assign 
if condition if 
then statement listl condition 
{else statement list2} then 
entries for statement 
listl 
[else \ 
{ , entries for statement, 
L list2 j 
close 
while condition while 
statement list condition 
end entries for statement 
list 
close 
repeat repeat 
statement list entries for statement 
until condition list 
condition 
close 
x(in(...),out(...)) call 


Figure 2 High level statements and 
entries in the IFT 
Figure 3(a) shows a segment of a high level 


program using the Runge-Kutta method for finding 
the numerical solution to the ordinary differen- 
tial equation y'=x+y with y(0)=1. Figure  3(b) 
shows the corresponding set of entries in the IFT. 
The actual data flow information for these entries 
will be illustrated later. Figure 3({c) shows the 
Syntax tree for entry 8. 


x := 0 
y:= 1 
is:=1 
repeat : 
Z3= xty 
kl := h*z 
k2 := h*¥(zth/2+k1/2) 
k3 3:= h¥(z+h/2+k2/2) 
k4 := h*¥(zthtk3) 
y t= yt(1. /6.)*(k14+2*k24+2%k3+k4) 
xX := xth | 
i := itl 


until n< i 
(a) High level program segment 


TREE 


TYPE TREE TYPE 
Pa a a a he ee a ee ai, 
0. assign TO 7 assign T6 
1. assign Tl 8 assign T7 
2. assign T2 9. assign T8 
3. repeat 10. assign T9 
4, assign T3 ie assign T10 
5. assign T4 12% condition T1l 
6. assign T5 13 close 
(b) IFT entries 

7, 
k4 
of % 
ian 
h k3 
(c) Syntax tree T7 
Figure 3 High level program segment and IFT 
entries 
Data Flow Analysis 
The data flow analysis technique presented 


IFT as an internal form of the 
program and also assumes that the target machine 
provides direct semantic support for those opera- 
tions which are implicit in the high level program 
(i.e., arithmetic operators, array selection and 
appendage, procedure call). The general technique 
is a top-down recursive descent flow analysis 
[10]. Since the IFT is a highly structured 
representation of the program and since procedures 


here assumes’7 the 


are free of side effects, the flow analysis is 
highly simplified. 
The total data flow analysis is performed in 


three phases. In the first phase, the input and 
output sets for each statement are collected. The 
second phase generates the use and definition 
information about each value and the third phase 
performs the live value analysis. The three 
phases are outlined in subsequent: sections and 
described in detail elsewhere [3]. 
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Collection of Input and Output Sets 


This section describes the generation of 
input and output sets for each type. of entry in 
the IFT. . ? 


The calculation of the input set and the out- 
put set for a non-interface IFT entry is straight- 


forward, as illustrated in Figure 4. 


The input and output sets for interface en- 
tries for compound blocks of statements (if, while 
or repeat) are somewhat more complicated, depend- 
ing on conditionally defined values and values 
which are used in a block prior to their redefini- 
tion. | 


Entry Type Input and Output Sets for Entry EO 
assign I(EO) = {x:x is referenced by the 
assignment statement} 
O(EO) = {x:x is defined by the 
assignment statement} 
condition I(EO) = {x:x is referenced by the 
condition} 
O(EO) = @ (null set) 
input I(EO) = {x:x is referenced by the 
input statement} u | 
{input filename} 
O(EO) = {x:x is defined by the 
input statement} vu 
{input filename} 
output I(EO) = {x:x is referenced by the 
output statement} u 
{output filename} 
O(EO) = {output filename} 
call, I(EO) = in(S) 
function or O(EO) = out(S) 
procedure where in(S) and out(S) are the . 


sets of parameter values in the 
high level statement with the 
corresponding directionality 
attribute. 


Figure 4 Calculation of input and output sets 
for single IFT entry blocks 


Let E = El,...,En be any set of entries in 
the IFT corresponding to a compound block of 
statements. Disregarding conditionally defined 


values, this set of sequential entries, E, has its 
input and output sets defined to be 


n i-l 
I(E) = I(El) u {u (I1(Ei) - vu O(Ej))} and 
i=2 j=1 
Pi | 
O(E) = u O(Ei). 
i=l 


This means that the input set for E consists of 
values which are used before their redefinition 
within the corresponding block of statements being 
processed and the output set contains all values 
defined within the block. 


Suppose that x is conditionally defined in 
such a block and x is used in some subsequent com- 
putation. The value for this use of x may depend 
on its conditional definition or on some previous 
definition. This is portrayed in Figure 5(a). In 
order to simplify the data flow analysis, the pre- 
vious definition of x is added to the input set to 
denote unconditional production of the most recent 
value of x whether it comes from within the condi- 


tional block or from the previous definition. 
This is portrayed in Figure 5(b). 
Type Input Output 
‘Set Set 
x := assign x 
if cond then x := if x x 


Zi= xX assign x 


(a) High level segment (b) IFT entries 


Figure 5 Conditional definition 


The calculation of the input and output sets 
for interface entries representing if, while and 
repeat blocks is given in Figure 6. An input set 
contains the upward exposed uses of values appear- 
ing in the block of statements constituting the 
body along with values which are conditionally 
defined by the block. The collection of this 
information is readily implemented by a _ top-down 
recursive descent parse. 


Interface entry Input and Output Sets 


EO for block of 


the form 
if C then E I(EO) = I(C) u ICE) vu OCE) 
O(EO) = O(E) 
if C then El T(EO) = I(C) vu ICEL) vu I(E2) 
else E2 u {O(EO) - (O(E1) 


n O(E2))} 
O(EO) = O(EL) U O(E2) 
while C do E I(EO) = I(C) u I(E) Uv O(E) 
O(EO) = O(E) 


I(EO) = I(E) vu (I(C) - O(E)) 
O(EO) = O(E) 


repeat E until C 


Figure 6 Calculation of input and output sets for 
interface entries 


Figure 7 shows the IFT entries for the 
Runge-Kutta program segment (presented in Figure 
3) expanded to include the input and output sets. 


Entry Input Set Output Set 
O. assign x 
1. assign y 
2. assign i 
3. repeat zk1,k2,k3,k4,y,i,x 
4. assign Zz 
5. assign k1 
6. assign k2 
7. assign k3 
8. assign k4 
9. assign 
10. assign x 
ll. assign i 
12. condition 1) 
13. close 
Figure 7 Input and output sets calculated for IFT 


Generating the Use and Definition Sets 


After the input and output sets have been 
constructed, the dependency relationships must be 
established between the definition of values and 
their subsequent use. This is done by matching 
names of values (names of identifiers in original 
source code) in the corresponding input and output 
sets of entries in the IFT. For all entries pro- 
ducing a value, a list is constructed showing all 
the entries where that value is used. Thus, 
use(x,Ei) denotes the set of entries at the same 
nesting level as Ei which use the value of x de- 
fined in Ei. For each value x used by an entry 
Ej, an ordered list def(x,Ej), having maximum 
length of two, is constructed showing the entries 
where the value was. defined. The entry which 
defines a value can be found by a backward scan of 
the preceding entries until the value appears 
either in an output set of an entry (corresponding 
to a statement at the same level within the same 
block) or in the interface entry of the enclosing 
statement. If it is not found, the value has no 
definition. If x is used in Ej, then def(x,Ej) = 
(a,(b,c)) denotes the definition set of x. Fora 
non-interface entry, this set consists of only a, 
the first element. For an interface entry, this 
set contains two elements a and (b,c). The ele- 
ment a identifies where the value was most recent-— 
ly defined outside the block and (b,c) identifies 
the last definition within the block. Except for 
the case of an if-then-else, c is null. A member 
of the set is denoted by def(x,Ej)(u) where u is 
one of a, b, ore. 


The use and definition analysis is presented 
in Figure 8 as a recursive top-down procedure 
which produces the use and def sets for the entire 
IFT. Suppose H denotes the interface entry for 
the block of statements to be analyzed and E 
denotes the set of entries corresponding to state- 
ments within the block. This procedure modifies 
the IFT entries by attaching the use and def sets. 


procedure useanddef (in(E,H) ,out (E,H)) 
elseflag := false 
for i = 1 to {E| do 
if TYPE(Ei) = (else or then) then 
if (TYPE(Ei) = else) then 
elseflag := true 
end if 
else 
for each x € I(Ei) do 
finddef (in(i,x,E,H) ,out (E,H)) 
end for 
for each x ¢ O(Ei) do 
if x e Of{H) then 
if elseflag then 
def(x,H)(c) := Ei 
else def(x,H)(b) := Ei 
end if 
end if 
end for 
if TYPE(Ei) = (while or repeat or if) then 
U := {x:x is a subblock of Ei} 
useanddef (in(U,Ei) ,out (U,Ei)) 
end if 
if TYPE(Ei) = (while or repeat) then 
for each x e¢ I(H) - O(H) do 
def(x,H)(b) := H 
use(x,H) := use(x,H) u H 
end for 
end if 
if TYPE(Ei) = (while or repeat or if or 
procedure or function) then 
for each x € O(H) do 
if def(x,H)(b) # G@ then 
SB := def (x,H) (b) 
use(x,SB) := use(x,SB) u H 
end if 
if def(x,H)(c) # @ then 
SB := def (x,H) (c) 
use(x,SB) := use(x,SB) u H 
end if 
end for 
end if 
end if 
end for 
end procedure 


procedure finddef (in(i,x,E,H) ,out(E,H)) 
found := false 
for j = i-1 to 1 while not found do 
if x « O(Ej) then 
def(x,Ei)(a) := Ej 
use(x,Ej) := use(x,Ej) u Ei 
found := true 
end if 
end for 
if not found then 
if x e« ICH) then 
def(x,Ei)(a) := H 
use(x,H) := use(x,H) u Ei 
else def(x,Ei)(a) := 9 
end if 
end if 
end procedure 


Figure 8 Use and definition analysis procedure 
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Figure 9 shows the IFT entries for the 
Runge-Kutta program segment (given in Figure 3) 
expanded to include the use and definition infor- 
mation. 


Entry Input’ Set Output Set 


[oss [ost [oe [om [oe 


0. assign ) x 3 
1. assign ) y 3 
2. assign ) i 3 
3. repeat x 0,10 j4,10 |ix,kl, (? 
Be 2,11 {il k2 ,k3,|? 
y 1,9 44,9 k4,y, |}? 
h ? 5,6,7,z,i 1? 
8,10 
n ? 12 
4, assign x 3 Z 3,0 50% 
y 3 7,8 
5. assign h 3 kl 3,6,9 
Zz 4 
6. assign h 3 k2 Set 59 
Zz 4 
k1 5 
7. assign h 3 k3 3,8,9 
Z 4 
k2 6 
8. assign h 3 k4 3,9 
Z 4 
k3 7 
9. assign y 3 y 3 
k1 5 
k2 6 
k3 7 
k4 8 
10. assign x 3 x 3 
h a 
11. assign i 3 i 3,12 
12. condition fli 11 if) 
fin 3 


13. close 


Figure 9 Use and definition for IFT 


Live Value Analysis 


Live value analysis provides necessary infor- 
mation for certain optimizing transformations. A 
value is defined to be live at a given point in a 
program if it has a subsequent use. Live value 
analysis requires information gathered in the 
first two phases of data flow analysis. Associat- 
ed with each value x in the output set of entry Ei 
is a boolean value, live(x,Ei), which indicates 
whether x is live at this point. 


A top-down recursive descent algorithm called 
liveanalysis is used to generate this information 
The algorithm analyzes values in the output set 
for each entry starting with the first entry of a 
procedure. If an interface entry is encountered, 
a recursive call on the procedure liveanalysis is 
made (propagating known live information inward) 
to analyze entries in the inner nesting level. 


The algorithm for live value analysis is 


given in Figure 10. This procedure modifies each 


value in the output set of the IFT entries by 
attaching a boolean value. The initial call would 
take the form liveanalysis(in(E,H) ,out(E,H)) where 
E is the set of entries corresponding to a pro- 
cedure H. Figure 11 shows the IFT for the Runge- 
Kutta program segment (given in Figure 3) expanded 
to include the live value analysis information. 


procedure liveanalysis(in(E,H) ,out (E,H)) 
for i = 1 to |E| do 
for each x e€ O(Ei) do 
live(x,Ei) := false 
if use(x,Ei) # @ then 
if use(x,Ei) = {H} then 
if TYPE(H) = (while or repeat) 
and x e I(H) then 
live(x,Ei) := true 
else 
if TYPE(H) = (procedure 
or function) then 


live(x,Ei) := true 
else live(x,Ei) := live(x,H) 
end if 
end if 
else live(x,Ei) := true 
end if 
end if 
end for 


if TYPE(Ei) = (while or repeat or if) then 
U := {x:x is a subblock of Ei} 
liveanalysis (in(U, Hi) ,out (U,EFi)) 
end if 
end for 
end procedure 


Figure 10 liveanalysis procedure 


Entry Input Set Output Set 
| val val live 
O. assign ¢G x true 
1. assign ) y true 
2. assign i) i true 
3. repeat x,i,y,h,n Zz kl, ? 
k2,k3, ? 
k4,y, ? 
i,x ? 
4. assign X,Y Zz true 
5. assign he kl true 
6. assign h,z,kl k2 true 
7. assign h,z,k2 k3 true 
8. assign h,z,k3 k4 true 
9. assign y ,k1,k2,k3,k4 lly true 
10. assign x,h x true 
11. assign i i true 
12. condition i,n ¢G 
13. close 


Figure 11 Live values for the IFT 
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Language Extensions for the Exposure 


of Parallelism 


In this section, extensions to the high level 
language are discussed which allow for higher 
utilization of a data flow machine. The concepts 
of the "forall" statement [1], the "stream" data 


type [13,17] and array to scalar functions are 
introduced to allow more efficient execution. The 
flow analysis described in the previous’ section 


remains basically the same. 


The forall statement, depending on its imple- 
mentation, allows for significant reduction in the 
order of the computation. The intent of the 
forall is that the invocations of the body are 
independent so that, in theory, all may execute in 
parallel. The syntax of the forall and _ the 
corresponding IFT entries are shown in Figure 12. 


High level statement Entries in the IFT 


forall 
forall condition 
entries for statement 
list 
close 


forall forall _cond do 
statement list 
end 


Figure 12 Forall statement and IFT entries 


The input and output sets are calculated in the 
same manner as the iterative for statement. It is 
assumed that the body of the forall statement 
obeys the single assignment rule which states that 
a value may be assigned only once during the exe- 
cution of the program. Thus, any value used on a 
right hand side within the body of the forall must 
be computed outside the forall statement. The 
only value that can be output from a forall 


statement is an array. 


The implementation of the forall statement is 
dependent on the underlying data flow architec- 
ture. Possible implementations include unwinding 
of loops by the architecture [5] or by recursion 
or the use of special hardware functions such as 
compose and decompose [15]. 


Loop decomposition, described by Lo [12] and 
extended by Allan [3], can be used as an optimiza- 


tion technique at compile time to transform some 
iterative statements to forall statements. Every 
value used in right context in the body of the 


loop is examined to determine if it depends on a 
value computed in a previous iteration. If the 
iterations are found to be independent, the loop 
is immediately transformed into a forall state- 
ment. Otherwise, attempts are made to break these 
data dependencies through forward substitution, 
saving old values of an array in a temporary ar- 
ray, making scalars into arrays and rearranging 
the code (maintaining the precedence relations 
that previously existed). If the iterations are 


still dependent, the loop is decomposed into 
smaller loops by finding a partition of the state- 
ments in the loop such that the precedence rela- 


tions between the statements are still preserved. 


If the partitioning is successful, each partition 
is treated as a single loop and the process is 
repeated. Loops which cannot be partitioned are 
executed in the original iterative manner. Figure 
13 illustrates a simple decomposition of a loop. 


is:= 1: forall i in (1,n) do 
while i <= n do b' (i) := b(i) 
b(i) := a(i) + c(itl) end; 
c(i) := b(itl) forall i in (l,n) do 
end b(i) := afi) + c(itl) 
end; 


forall i in (1,n) do 
c(i) := b' (itl) 
end 


(a) before (b) after 


Figure 13 Example of loop decomposition 


As a second technique, streams appear to 
offer some advantages. When a forall is not ap- 
plicable, streams might still be used to reduce 
the coefficient of the order of the computation 
through pipelining sections of the data flow pro-~- 
gram. Streams, combined with recursion, can 
result in a reduction in the order of the computa-: 
tion. 


A third technique for the higher utilization 
of a data flow machine is the introduction of cer- 
tain functions (e.g., sum, product) which map an 
array or stream to a scalar value. A recursive 
implementation may be used to reduce the order of 
a computation from O(n) to O(log2 n), where n is 
the length of the array or stream. 


Code Generation 


This section illustrates how the information 
provided by the data flow analysis may be used in 
generating code for a highly parallel data flow 
machine. 


The data flow program can be viewed as a 
directed graph consisting of nodes and edges [8]. 
Figure 14 shows a data flow graph for the Runge- 
Kutta program. A node represents a base language 
operation and an edge represents a data dependency 
between nodes. The normal firing rules allow a 
node to execute whenever there is a value on each 
of its input edges and no tokens on any of its 
output edges. The value produced by the node is 
placed on each of the output edges. 


Special firing rules exist for merge gate and 


true and false gate operations, which support con- 


ditional execution. A merge gate of the form (HE 


32 


Figure 14 Data flow graph of Runge-Kutta program 


takes a data value from its T input edge or F 
input edge (denoted by open arrow heads) depending 
on a boolean control value present on its control 
edge (denoted by a closed arrow head). A true 
gate of the form @) allows a data value to be 
passed from its input edge (open arrow head) to 
its output edge if a true boolean control value is 
received on its control edge (closed arrow head). 
The data value is destroyed if a false boolean 
control value is received. Analogous firing rules 
hold for a false gate of the form ©). Boolean 
control values are produced by relational nodes. 


The target language generated by the compiler 
is a set of instructions for the data flow 
machine, which is simply a linear representation 
of the data flow graph. Following the flow 
analysis described in the previous section, it is 
conceptually easy to generate the data flow graph. 


For entries in the IFT, inter-entry dependencies 
have been established by the use and definition 
analysis. Intra-entry dependencies are esta- 
blished according to the syntax tree (TREE) of the 
IFT entry. Generalized code templates for three 
constructs, found in conventional high level 
languages, appear in Figure 15. In each of the 
graphs, the edges labeled IN or OUT indicate the 
sets of data values that pass into and out of the 
specified construct. The set of values represent-— 
ed by the IN edge can be found in the input set of 
the interface entry and the set of values 
represented by the OUT edge can be found in the 
output set of the same entry. Figure 14 illus- 
trates the details for a repeat-until construct. 


Conclusions 


This paper has presented a data flow analysis 
procedure which is useful in the translation of 
high level languages to the machine language for a 
highly parallel data flow processor. This tech- 
nique could be used for a variety of parallel 
architectures and is similar to flow analysis 
techniques used for code optimization on a conven- 
tional machine. On one hand, the algorithm 
presented in this paper is generally simpler’ than 
other techniques due to the enforcement of struc- 
tured programming constructs and the elimination 
of side effects. On the other hand, the algorithm 
maintains more data flow information than do other 
techniques Since the primary purpose of the 
analysis is to generate code for a data flow 
machine. 


A compiler has been implemented and is fully 
operational using this technique in the generation 
of code for execution on a simulated data flow 
machine. 


fn : pe ra fl, digme pel 
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(a) if-then-else 


(b) while-do 


Ov PALA tA eae (c) repeat-until 


Figure 15 Generalized code templates 
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AN ABSTRACT IMPLEMENTATION 


FOR 


CONCURRENT COMPUTATION WITH STREAMS®) 


Jack B. Dennis 
Ken K.-S. Weng 
Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 02139 


Abstract -- This paper is a_ contribution toward 
developing practical general-purpose computer systems 
embodying data flow principles. We outline a hardware 
structure capable of high concurrency and present an 
abstract model of data flow program execution which could 
be implemented within the proposed hardware structure. 
Our abstract model supports a user programming language 
that includes recursive function modules and provides 
streams of values for inter-module communication. 


Introduction 


We present here a conceptual model of program 
execution that can serve as the functional specification for 
a distributed or highly concurrent computer system based on 
data flow principles. The programming language supported 
by our conceptual model or “abstract implementation" is an 
applicative or value-oriented language that includes 
streams of values as a basic programming tool. Streams are 
attractive because use of streams for communication 
between program modules leads to programs whose modules 
have functional semantics and whose overall meaning can 
be expressed as functional components combined using 
composition and a fixpoint operator [12] - thus avoiding use 
of side effects. In the present discussion we only consider 
determinate programs. The extension of this work to 
nondeterminate computation is a subject of current 
research. 


Specifically, we introduce a value-oriented language 
and discuss representation of its semantics by translation 
into recursive data flow schemas [9]. We sketch an 
operational semantics (formal interpreter) for these data 
flow schemas and outline the structure of a hardware 
system capable of highly concurrent execution of 
value-oriented programs. A more detailed and complete 
presentation of this work is given in the thesis of Weng 


[17]. 


(a) This research was supported in part by the National 
Science Foundation under research grant MCS75-04060 
AO1 and in part by the Lawrence Livermore Laboratory of 
the University of California under contract 8545403. 


35 


A Simple Value-Oriented Language 


Our textual language departs from conventional 
languages in several ways. There is no notion of sequential 
control flow and there are no explicit primitives for 
introducing parallelism. The concurrency of a computation is 
determined by the data dependency within the program 
rather than by explicit creation of concurrent processes. 


The language is value-oriented in the sense that each 
syntactic unit defines a mathematical function that maps 
input values into result values: there are no side effects or 
other spurious interactions in the evaluation of expressions. 


The language does not have the notion of memory 
locations or variables commonly found in conventional 
sequential programming languages; instead names are used 
to denote values defined by expressions in much the same 
way as in mathematics. With value-oriented semantics, it is 
natural to write programs in a form that exhibits the inherent 
concurrency of an algorithm. The data types of the 
language) are integer, real, boolean, character-string, 
structure, and procedure. We shall call these data types 
simple data types. The operations for types integer, real, 
boolean, and character-string are the usual operations and 
need no comment. The operations for values of type 
structure are defined below. The only operation for 
procedure values is procedure application. 


The syntax of the language is given in Fig. 1. A 
procedure consists of a set of procedure definitions 
followed by an expression. A procedure definition is of the 
form 


P = procedure ( a4:T4,..., Am: Tm) ylelds Rq,...Rp; 
<procedure def> 


<procedure def> 
<expression> 
end P; 


ecetaeenan ree cert PEELE tT RS IST EORTC ESOS ESPECIAL LE ALE IO TTS TSC 


(a) The language described here is closely related to the 
language called VAL in development at MIT [3]. 


Notation: © § {<E>}* means <E>|[<E>,{<E>}* 
{<E>} means { < E> }* | empty 


< program > ::= program { < procedure def > } < expression > end 
< procedure def > ::= < name > = procedure ( < input list > ) 
yield < output list >; 
{ < procedure def > }; < expression > 


< input list > ::= { < type declaration > } 
< type declaration > ::= < name >: < type > 
< output list > ::= { < type > } 


< expression > ::= ¢ primitive expression > 
| { < expression > }* 
| < let-block expression > 
| < conditional expression > 
| < application expression > 
< let-block expression > ::= 


end < name > 


let { < type declaration > }; { < name def > }; in < expression > end 


< name def > ::= { < name > } = <¢ expression > 


< conditional expression > ::= | 


if < expression > then < expression > else < expression > end 


| < application expression > ::= «name > ( < expression > ) 


¢< primitive expression > ::= 


< expression > < primitive operation > < expression > 
| < primitive operation > ( < expression > ) 


| < name > 
| < constant > 


< simple data type > ::= integer | real | boolean | character-string | structure 
< type > ::= < simple data type > | stream of < simple data type > 


Figure 1. Syntax of the language 


This ‘defines a procedure P that requires m input values 
Q45-58_, Of types Ty ssT mp respectively. The names 
&4,-...€8,, must be distinct and can appear free in 
<expression>. The evaluation of the procedure yields an 
ordered set of values of types Rj,...,R, resulting from 
<expression>. 


Each expression denotes an ordered set (n-tuple) of 
values whose arity is n. We give a recursive definition of 


the arity A(E) of each of the five types of expressions as 
follows: 


A( <primitive expression) ) = 1 

A( <exp,>,... <eExpy? ) 
= A( <expy> ) +... + AC <exp,> ) 

A( <let-block expression> ) 
= A( let <definitions> in <exp> end ) 
= A( <exp> ) 


A( <conditional expression> ) 
= AC if <exp> then <exp,> else <exp,> end ) 
= A( <exp,> ) 
= A( <exp,> ) 
A( <procedure application ) 
= A( <name> ( <expression> ) ): 
= the number of elements in the <output list> 
of procedure <name>. 


For a <procedure def> to be correct, the arity of the 
expression which is its body must match the number of 
result types specified in its <output list. 


Often it is convenient to introduce names _ for 
expressions because they are common subexpressions of 
larger expressions. The let-block expression is used for 
introducing names such that each name stands for an 
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expression of arity one. A let-block expression is 
form: 


of the 


let { <type declaration) }; 
<name-list,> = <exp,); 


<name-list,> = <exp,>; 
in <exp> end; 
The names in type declarations of a let-block are local 
names meaningful only within the block; these names must 
be distinct from each other and may appear free in 
<eXP 4 >,...,CEXP_>, and <exp>. Name conflicts in nested 
let-blocks are resolved by the scope rule that Inner 
definitions take precedence over outer definitions. 


We require that the number of names in a name-list be 
equal to the arity of the expression to the right of the 
equality sign. The value of a name in a name-list is the 
value of the corresponding expression appearing on the 
right hand side of the equal sign, and must be of the type 
specified by the type declaration. The value of a let-block 
expression is the value of <exp>. 


A conditional expression is of the form: 
if <exp,> then <expo> else <expy> end; 


The expression <exp4> is a boolean value of arity one. The 
expressions <expo> and <expy> have the same arity and 
the corresponding value in each expression must be of the 
same type. The value of a conditional expression is the 
value of <expy) if <exp;> is the boolean value true; 
otherwise it is the value of <expa>. 


A procedure application expression Is of the form: 
P( <exp> ); 


where the expression <exp> has the same arity as the 
number of input values required by the procedure P and the 
type of each value matches that of the input specification. 
The result of the procedure application is an expression of 
the arity and types defined by the yield clause of the 
procedure heading. 


As a simple example of a program in our value-oriented 
language, Fig. 2 shows a procedure that defines a parallel 
computation of the factorial function. 


Data Structures 


For the purpose of the present exposition, we will 
introduce a simple but very general data structure type. A 
data structure can be either nil which denotes the structure 
having no components, or a structure having .n component 
values V41-+s5V_ whose selector names are respectively 
The selectors are either character strings or 
integers and each selector name must be different from all 


$4 yoo Sy 


of 


Factorial = procedure (n: integer ) 
yields integer; 


Product = procedure (nj : integer, no : integer ) 
yields integer; 
if No =< ny then ny 
else let middle : integer; 
middle = (n, + n5) quotient 2; 
in Product( nj, middle ) 
* Product( middie+1, no ) end 


end Product; 
if n < O then error else Product(1, n) end; 


end Factorial; 


Figure 2. An Example Program 


others in the same data structure. We represent such a 
structure value by the notation 

(S42 Vqs---5 Sp: Vp): 
The operations on data structures are defined below, where 


d and d' are data structures, s is a selector name, and c Is 
a value of any type: 


(1) create ( ) 
The create operation yields the nil data structure. 

(2) append (d, s, c) 
The result is a data structure d' which is identical 
to d except that the s component is c regardless of 
whether d already contains a component with 
selector name s. 

(3) delete (d, s) 
The result is a data structure d' which does not 
have ans component. 

(4) select (d, s) 
If d has an s component, the result is the value of 
that component. Otherwise, the result is the value 
undefined. 

(5) nil-structure (d) 
This is a predicate whose value is true if d is nil; 
otherwise its value is false. 


Notice that the effects of 
delete (d, s) 
and 
append (d, s, nil) 
are different, since the the delete operation would remove 


the component (s, d') while the append operation would 
replace it with (s, nil). It should be mentioned that an array 


reverse = procedure ( x : structure ) 
yields structure; 


if nil-structure (x ) then x else 
let left, right : structure; 
left = reverse( select( x, "r") ); 
right = reverse( select( x, "I") ); 


in append( append 
( create( ), "I", left), "r", right) 


Figure 3. reverse 


is simply a data structure whose selector names are all 
integers. 


The data structure operations are illustrated by the 
recursive procedure "reverse" in Fig. 3, which interchanges 
the role of selector names | and r in a given data structure 
of arbitrary depth. | 


Streams 


A stream is a sequence of values, all of the same type, 
that are passed in succession, one-at-a-time between 
program modules. 


The use of streams of data in programming is an 
alternative way of expressing computations that have 
conventionally been expressed as coroutines or a set of 
cooperating processes. For example, a compiler may be 
organized into phases which are implemented as a set of 
coroutines [6]. 


The operations on values of type stream of T are 
defined below where s and s' are streams, and c Is a value 
of type T. 


(YC) 
The result is the empty stream which is the 
sequence of length zero. 
(2) cons (c, s ) 
The result is a stream s' whose first element is c 
and whose remaining elements are the elements of 
the stream s. 
(3) first (s ) 
The result Is the value c which is the first element 
of s. If s is empty, the result is undefined. - 
(4) rest (s ) 
The result is the stream left after removing the first 
element of s. If s =[ J, the result is undefined. — 
(5) empty (s ) | 
The result is true if s = [ J], and is false otherwise. 
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prime generator = procedure (n: integ er ) 
yields stream of integer; 


generate = procedure (i, n: integer ) 
yields stream of integer; 
ifi<n then [ J 
else cons ( i, generate( i+1, n ) ) end; 
end generate; 


sieve = procedure (s : stream of integer ) 
yields stream of integer; | 
if empty (s ) then [ ] 
else let x : integer, 
So, Sg : stream of integer; 
X, So = first (s ), rest (s ); 
Sg = delete ( x, So ); 


‘In cons (x, sieve( sg ) ) end; 


end sieve; 


delete = procedure ( x : integer, 
s : stream of integer ) 
yields stream of integer; 
if empty (s ) then[ ] 
else let y : integer, 
So, Sg : Stream of integer; 
Y, So = first ('s ), rest (8 ); 
$3 = delete ( x, So ); 
in. if divide ( x, y ) then sg 
else cons ( y, Sg ) end; 


end delete; 
sleve ( generate ( 2,n) ); 


end prime__ generator; 


Figure 4. A Prime Number Generator 


The following identity is satisfied by the stream operations: 


if empty(s ) then s=[] | 
| | = cons( first(s ), rest( s ) ) 


The problem of generating all prime numbers less than 


a given integer n is a good example of the use of streams in 


constructing a modular program so as to expose many 
independent actions for concurrent execution. The sieve of 


Eratosthenes expressed in our textual language is 
presented in Fig. 4. The procedure "generate" produces 
the sequence of successive integers beginning with 2. This 
‘stream is processed by "sieve" to. remove nonprime 
elements. Procedure "sieve" operates by taking the first 
element of its input and removing all multiples of the first 
element (using "delete") and applying "sieve" recursively to 
the remaining elements. (The first use of stream concepts 
for the prime number sieve, as far as we know, was in [16]. 
it seems the example has been discovered independently 
by several authors.) 


Data flow schemas 


A data flow schema is an operational model of 
concurrent computation. The form of schemas used here 
derives from the work of Dennis and Fosseen [9] and Dennis 
[7]. A data flow schema Is a directed graph composed of 
nodes called actors and arcs connecting them. An arc 
pointing to an actor is called an input arc of the actor; and 
an output arc {is an arc emanating from the actor. Each 
actor has an ordered set of input arcs and output arcs. 
There are five types of actors: link, operator, switch, merge 
and sink. The five types of actors are shown in Fig. 5. An 
(m, n) data flow schema must have m links which do not 
have input arcs, and n links not having output arcs. These 
links are respectively called /nput links and output links of 
the (m, n) schema. Further, we require that the schema 
must be proper in the sense that all other actors must have 
the required arcs of its actor type, and each arc must be 
connected at both ends. 


(a) Link (d) merge 
(b) operator (e) sink 
1 m 
if 4 
(c) switch (data) 
(control) 
Figure 5. Data flow actors. 
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(a) link (b) operator 
v V1 uv 
ba “n 
u = true if u = false 
T FY 
Vv 


Examples of firing rules. 


Figure 6. 


Stating the operational semantics of data flow 
schemas requires additional concepts. A configuration of a 
data flow schema is the graph of the schema together with 
an assignment of labeled tokens to some arcs of the graph. 
An assignment of a token to an arc is represented by the 
presence of a solid circle on the arc. The label denotes the 
value carried by the token and may be omitted when the 
particular value is irrelevant to the discussion. Informally, 
the presence of a token on an arc means that a value Is 
made available to the actor to which the arc points. For the 
present, tokens carry values of type integer, real, boolean, 
structure, or stream. 


Firing Rules 


Execution of an (m, n) schema advances it from one 
configuration to another through the firing of some actor 
that is enabled. The firing rules for the principal actor types 
are specified in Fig. 6. A necessary condition for any actor 
to be enabled is that each output arc does not hold a token. 
An actor is enabled when a token is present on each input 
arc -- with the exception of a merge actor. The firing of an 
actor causes the tokens to be absorbed from the input arcs 
and completes by placing a token on each of the output 
arcs. The values of the output tokens are functionally 
related to the values of the input tokens. A link simply 
replicates the value received and distributes it to the 
destination actors indicated by output arcs. The effect of 
firing an operator is to apply to the inputs vj,...,v,, the 
function associated with the operation name written inside 
the operator to yield the outputs uy,...,u,. The switch and 
merge are used for controlling the flow of tokens. A switch 
requires a data input and a control input which is a boolean 


value. The firing of a switch replicates the input token on 


one of the output arcs according to the boolean control 
value. The arrival of a token on elther input arc enables a 


merge, and upon firing, a token conveying the same value is 
placed on the output arc. The behavior of a merge Is 
inherently nondeterminate: when two input tokens reside on 
the input arcs, the firing rule does not specify in which 
order the output tokens will be generated. A sink absorbs 
the input tokens upon firing and places a special token 
signal on the output arc. The purpose of a sink actor Is to 
absorb unwanted values; the signal output token Is 
necessary for the implementation of schema application to 
be described. 


The set of functions commonly associated with an 
operator includes the scalar arithmetic operations and 
constant functions. 


Well Formed Data Flow Schemas 


Unrestricted use of actors in data flow schemas is 
undesirable since an arbitrary interconnection of these 
actors may form a schema which deadlocks or has 
nondeterminate behavior. Because these properties are 
undesirable for reliable programming we choose a subclass 
of schemas which will satisfy the needs of programming. 


An (m, n) well formed data flow schema is an (m, n) 
data flow schema formed by any acyclic composition of 
component data flow schemas, where each component is 
either a link, a sink, an operator, cr a conditional subschema. 


(result) 


trig 


Figure 7. A conditional schema. 
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Fig. 7 is an example of a conditional schema which computes 
the value of the expression | | 


ifa>bthena+belseb-3 


Here, the trig output provides a completion signal indicating 
that the sink actor has absorbed the unused copy of a. The 
structure of a conditional schema corresponds in an obvious 
way to conditional expressions. 


The Apply Actor 


The class of well formed data flow schemas cannot 
express program features such as procedures, procedure 
applications, and iterations. We introduce an actor apply 
whose meaning is explained in Fig. 8. The first input to an 
apply actor is a token associated with an (m, n) well formed 
data flow schema. An apply actor is enabled when a token 
is present on each input arc. The effect of firing an apply 
actor is to replace the actor with the specified (m, n) 


| schema as shown in the figure. The (m, n) schema replacing 


the apply actor may itself contain apply actors, allowing 
recursion to be expressed. 


. We have not included structures of data flow schemas 
which correspond to language constructs such as while 
loops in Algol 60 or Do statements in Fortran. Such 
structures necessarily involve cyclic connections of actors 


which do not correspond to actual data dependencies, and 


introduce unnecessary delays. Furthermore, the semantics 
of cyclic schemas is more complicated, since issues of | 
safety and liveness must be dealt with. We choose to 
support these language features in the equivalent form of 
recursive application of data flow schemas. This allows 
simultaneous execution of instances of a data flow schema 
which correspond to successive iterations of a while loop. 


An example of the use of apply actors is given in Fig. 
9. This recursive schema implements the "reverse" function 
stated earlier in Fig. 3. The input link actor labeled trig is 
an input link whose function is to trigger those actors that 
generate constants, in this case the create actor that 
produces the empty data structure. __ | 


The apply actor presented requires that all input 
values be present on the input arcs to become enabled. A 
language implemented in terms of the apply actor will have 
“call by value" semantics, that is, the result of application is 
well defined only when the computations producing 
arguments to the procedure all terminate. This is in contrast 
with a more general form of procedure application which 
allows procedure application to begin even though 


‘computation of some arguments is not complete. 


Data Flow Processor 


The structure of a data flow processor suitable for 
supporting execution of recursive data flow schemas is 
shown Fig. 10. It consists of six subsystems: Functional 
Units, Structure Controller, Execution Controller, the 
Arbitration and Distribution Networks, and the Packet 


Figure 8. The 


trig 


actor. 


reverse 


create 


Figure 9. 


apply 


r 


(result) 


Recursive schema. 
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Memory. The Execution Controller fetches instructions and 
operands from the Packet Memory and forms them into 
operation packets. Each operation packet is passed to the 
Arbitration Network for transmission to an appropriate 
Functional Unit if a scalar operation is called for, or to the 
Structure Controller for the data structure operations 
create, append, and select. Instruction execution in the 
Structure Controller and Functional Units generate result 
packets which are sent through the Distribution Network to 
the Execution Controller where they will join with other 
operands to activate their target instructions. How this Is 
done is explained in greater detail in the next section. 


The Packet Memory holds the collection of data 
structures as a collection of /tems each being a one-level 
data structure having scalar values and unique identifiers of 
other items as its components [8]. This collection of items 
represents an acyclic directed graph where each arc 
corresponds to a unique identifier component of the item 
representing its origin node. The Packet Memory maintains 
a reference count for each item and reclaims physical 
storane space as items become inaccessible. 


Data structures held in the Packet Memory have three 
roles in the execution of data flow schemas: (1) as 
operands for the data structure operations implemented by 
the Structure Controller; (2) as procedure structures that 
have as components the instructions of a data flow 
procedure; and (3) activation records which hold operand 
values for instructions waiting for their enabling condition to 
be satisfied. 


Although the Execution Controller, Structure Controller 
and the Packet Memory are shown in Fig. 10 as single units, 
we imagine that each Is in fact a collection of many identical 
units. For example, the Packet Memory subsystem would 
consist of separate systems, each holding all items whose 
unique identifiers belong to a well defined part of the 
address space of unique identifiers. The Execution 
Controller subsystem would consist of identical modules 
each of which would serve a distinct subset of procedure 
activations. 


The concept of a Packet Memory System was 
Introduced in [8], and the design issues for these systems 
and the Structure Controller have been studied in [1, 2]. 


implementation of Data Flow Schemas 


Procedure Structures 


A data flow schema Is represented In the machine by a 
kind of data structure called a procedure structure 
illustrated in Fig. 11a. A procedure structure corresponding 
to a data flow schema of n actors is a data structure having 
nm components with integer selector names from 1 to n 
assigned to the actors. Each component, called an 
instruction, is an encoding of an actor and its output arcs. 
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Figure 10. Data flow processor. 


Tae components of an instruction include an operation 
field which defines the function performea by the actor, and 
destination fields D1, ., Dp corresponding to p output arcs. 
Each destination field has three subcomponents: the /nst 
component is the integer selector name of the destination 
Instruction; the arc component Is an integer designation of 
an input arc of the destination; and the count component is 
the number of operand values required by the destination 
instruction. | 


Activation Records 


‘Since multiple instances of the same schema may be 
concurrently active in a computation, each activation (an 
instance of procedure execution) is represented by a 
separate activation record as shown in Fig. 11b. Each actor 


In an activation is uniquely identified by the tuple (A, 1), 
where A is a uid allocated for the activation record and i is_ 


the integer assigned to the actor in the procedure 
structure. A token of value v on the k-th input arc of an 
actor (A, i) corresponds to a result packet that carries the 
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Figure 11. Procedure and activation structures, 


information (A, i, k, v, count), where "count" is the number of 
tokens (operands) required for the enabling of the actor. 


Enabling of an actor is detected by checking the 
number of result packets having arrived at the operand 
record -- the i component of the activation record A - 
against the count in the result packet. The detection of 
enabling is a function of the Execution Controller and the 
Packet Memory that store activation records. Upon enabling 
of actor instance (A, i), the instruction of the actor Is 
fetched from the i component of the procedure structure. 
The following section describes how activation records 
might be manipulated. 


_ An activation record has components with Integer 
selectors for.operand records and an additional "text" 
component that is the procedure structure for the 
activation. .(In our implementation, this component is shared 
by other activations of the same schema.) An operand 
record may have as many integer subcomponents as input 
arcs of an actor, and also contains an "arrived" 
indicating the number of arrived result 
packets. Since an activation record stores values of 
arrived result packets in its components, operations on an 
activation record modify its aii These operations 
are defined as follows: — 


(1) create-activation( P ) 

This returns the uid of a new activation record 
having P as Its "text" component, but no other 
components. 

insert( A, I, k, v ) 

The insert operation adds the value v as the k-th 
operand of the i-th instruction in activation record 
A. In addition, the "arr" component of the operand 
record is incremented by one. To handle the first 
operand value to arrive, a missing "arr" component 
is interpreted as having the value zero. 

remove( A, i ) 

This operation releases the i component of A; and is 
performed by the Execution Controller once it has 
generated the operation packet for actor instance 
(A, 1). 

free( A) 

This operation releases the entire activation record 
A by means of a command packet sent to the 
Packet Memory. . 


(2) 


(3) 


(4) 


For each arriving result packet ( A, i, k, count, v ) the 
Execution Controller performs the operation insert( A, |, k, v 
) and tests the updated value of the "arr" component 
against the "count" field of the result packet. If the values 
are equal, the instruction ts fetched from the Packet 
Memory and used, together with the operand record, to 
construct an operation packet which is delivered to the 
Arbitration Network. The | component of activation record A 
is then released. 


Procedure Activation 


Our implementation of the apply actor is illustrated In 
Fig. 12. The apply actor is replaced by the code 
diagrammed in Fig. 12b, and the applied graph F its 
augmented as in Fig. 12c. Here we use the notations 


Vv Vv 


to mean insert ( A, i, 1, v ). The new actors extr-uid, 
const-ret and distribute will be explained below. 


This implementation assumes the actors in each 
recursive schema are numbered according to this rule: 


(1) Input link actors are numbered 1, ..., m. | 

(2) The link actors that receive the n-tuple of values 
resulting from a schema application are numbered J 
+ 1,..., J +n for some integer J. 

(3) A link actor numbered 0 receives a packet (A, J, n 
) containing the information needed to construct 
result packets for returning values resulting from 
procedure execution. ) 

(4) The remaining actors may be numbered arbitrarily. 
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(b) calling graph 


distribute 


Figure 12. Implementation of apply. 


The implementation scheme works as follows: The 
create-act actor produces the uid A' of a new activation 
record containing "text" component F‘ and passes it to the 
insert actors associated with Input value V1» Vm These 
actors cause result packets of the form ( A’, |, 1, 1, vj ) to 
be generated which initiate execution of the new activation 
of F'. At the same time, the extr-uid and const-ref actors 
form the return value ( A, J, n ) and send it to link O of 
schema F'. Once result values yj, .., Y, have been 


“produced, the distribute and insert actors of F' generate 


result packets of the form ( A, J + |, 1, 1, y; ) which deliver 
result values to the calling schema. The free actor then — 
releases the activation record, and its uid A’ Is returned to 
the pool of free uid's managed by the Packet Memory. 


implementation of Stream Actors 


In the implementation streams are represented as data 
structures. A stream Is a data structure having an 2 ae 
component which is the first element of the stream, and an 
"re" component which is the data structure representing the 


rest of the stream. The empty stream is represented by nil. 

Operations. on streams become operations on structure 
values; thus first( s ) and rest( s ) are implemented by 
select( s,"f" ) and select( s,"r" ), respectively. 


We wish to make it possible for a stream to be 
processed by consuming modules while further stream 
elements: are generated concurrently. To provide for this 
behavior, we must augment our concept of data structures 
so a data ‘structure may be accessed before It is entirely 


constructed. We use the concept of holes which is based | 


on the work of Henderson [11] who used the term “token”. 
Our idea is related to but different from the idea of 
“suspensions” discussed by Friedman and Wise [10]. 


The idea is embodied In the implementation of the cons 
operation described in Fig. 13. Here the create-hole and 
write-hole actors are. special data structure operators 
defined as follows: 


A create-hole actor returns a uid H allocated from tne 
data structure address space. The free node is called 
a hole in that it has two states: filled and unfilled. In 
the unfilled state, all data structure operations on the 
hole - are queued except the write-hole operation. 
Upon. completion of the write-hole(H,v) operation, the 
hole H changes its state to filled and contains the 
value v. All previously queued and subsequent 
operations on H are processed without further delay; a 
subsequent write-hole operation on H is illegal. 


To illustrate the concurrency provided by this 
implementation of streams, consider the recursive schema 


cons (v, s) 


| stream[T] 


hole 


Implementation of cons. 


Figure 13. 
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trig s 


apply "delete" 


rT YD 
Dy, a, 
trig (result) 
Figure 14. Data flow schema for "sieve". 


shown in Fig. 14 for the "sleve" procedure of the prime 


number generator. Note that the output of the top 
activation of "sieve" will be a data structure containinng 
the first element of the result stream and a hole waiting to 
be filled in with the data structure generated by the 
recursive activation of "sieve". In this Implementation each 


higher activation of "sieve" may be released as soon as It 
has completed its work (i.e., its hole has been filled), 
leaving | the | remaining work to be finished by deeper 
activations of the code. 


Remarks 


The concept of stream has appeared in many forms 
(5, 12, 14, 15]. One of the earliest papers that discussed 
streams as a programming feature was an unpublished paper 
by Mcliroy [15]. Despite the conceptual elegance of 
streams, programming has not yet departed from the. 
sequential notion of coroutines and process synchronization 


primitives. 


Recent interest in concurrent programming 


languages and processors have motivated several other 
authors to investigate the feasibility of implementation of 
streams and related concepts of data structures with holes 
or with suspensions [4, 10, 13]. 
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TRANSLATION AND OPTIMIZATION OF DATA FLOW PROGRAMS”) 


J. Dean Brock 
Lynn B. Montz 
Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 02139 


Abstract -- We present ADFL, an Applicative Data Flow 
Language with an iterative control abstraction based on tail 
recursion and an error-handling scheme appropriate to the 
concurrency of data flow. An algorithm for translating ADFL 
programs into data flow graphs is described. These graphs 
may be executed without possibility of deadlock, but with 
potential loss of some concurrency, on _ packet 
communication systems with bounded buffering, such as the 
Dennis-Misunas data flow computer. Two techniques for 
optimizing graphs are given and their effect on performance 
and correctness is analyzed. One is the insertion of identity 
operators (buffers) into graphs to increase pipelining. The 
other is the elimination of unneeded acknowledge signals. 


Introduction 


In a data flow computer, an operation is performed as 
soon as its operands have been computed. The machine 
language is an_ explicit representation of the data 
dependencies of program operations. Its programs are 
directed data flow graphs whose nodes are called 
operators. The role of operators in a data flow machine is 
similar to the role of instructions in a von Neumann machine. 
The execution of an instruction corresponds to the firing of 
an operator. Each operator has several input and output 
ports. Whenever an operator fires, it absorbs tokens 
(values) at its input ports and produces tokens at its output 
ports. Operators have firing rules which determine when 
they are enabi/ed for firing. These firing rules are based on 
the presence or absence of tokens on the operator's ports. 


When operators are joined to form data flow graphs, 
the links of the graph are directed from operator output 
ports to operator input ports. A link transports the results 
produced at an operator output port to an operator input 
port. Thus, links form the pathways upon which data flows 
as tokens are absorbed and produced by the firing of 
operators during the execution of a graph. 


(a) This research was supported in part by the Lawrence 


Livermore Laboratory of the University of California under 
contract 8545403, in part by the National Science 
Foundation under research grant MCS75-04060 A01, and in 
part by the Advanced Research Projects Agency of the 
Department of Defense under Office of Naval Research 
contract NOO014-75-C-0661. Part of the research was 
conducted while Mr. Brock was supported by a National 
Science Foundation graduate fellowship. 
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The data flow graph of an elementary expression 
resembles its parse tree. The graph for computing the 
distance function: 


sqrt((x1-x2)* + (y1-y2)2) 


is illustrated in Figure 1. The solid black dot in the figure 
represents the copy operator which is used to distribute 
the results of one output port to several input ports. Note 
how this graph represents the operation dependencies and 
independenclies of the distance function. 


Preliminary data flow machine designs have been made 
by Arvind and Gostelow [2], Davis [5], and Dennis and 
Misunas [7]. Within these machines, a data flow graph is 
distributed over a network of processing elements. These 
elements operate concurrently, constrained only by the 
operational dependencies of the graph. Thus, a very 
efficient utilization of the machine's resources appears 
possible. 


ADFL - An Applicative Data Flow Language 


Data flow programming languages resemble 
conventional languages restricted to those features whose 
ease of translation does not depend on the state of a 
computation being a single, sequentially manipulated entity. 
Because the "state" of a data flow graph is distributed for 
concurrency, goto's, expressions with side effects, and 
multiple assignments to the same variable are difficult to 
represent. ADFL, Applicative Data Flow Language, is a 
simplification of VAL, the value-oriented data flow language 
being developed by Ackerman and Dennis[1]. A BNF 


Figure 1. sqrt((x1-x2)2 + (y1-y2)2) 
x1 x2 y1 


sqrt((x1-x2)2 + (y1 -y2)) 


specification of the syntax of ADFL follows: 


exp ::= id | const | exp , exp | oper(exp) | 
let idlist = exp in exp end | 
if exp then exp else exp end | 
for id/ist = exp do iterbody end 


iterbody ::= exp | iter exp | 
let idlist = exp in iterbody end | 
if exp then /terbody else iterbody end 


id ::= "programming language identifiers" 
idlist ::= id { , id } 
const ::= "programming lanquage constants" 


oper ::= "programming lanquaqge operators" 

The most elementary expressions of ADFL are 
identifiers and constants. Tuples of expressions are also 
expressions: One such expression is "x, 5". The application 
of an operator to an expression is an expression. Although, 
the BNF specification only provides for operator applications 
in prefix form, such as "+(x, 5)"; applications in infix form, 
such as "x + 5", are considered acceptable equivalents 
(sugarings) and will be used in example ADFL programs. In 
sequential programming languages execution exceptions are 
generally handled by program interrupts (signals). This 
solution is inappropriate for data flow since there ts no 
control flow to interrupt. Applied to "exceptional" inputs, 
data flow operators yield special error values, such as 
zero divide or pos over. The documentation of VAL [1] 
contains a detailed specification of this method of 
error-handling. For simplicity, only one error value undef is 
used throughout this paper. 


Since ADFL is applicative, it provides for the binding, 
rather than the assignment, of identifiers. Evaluation of the 
binding expression: 


let y,z=x+5,6iny * z end 


implhes the evaluation of "y * 2" with y equal to "x + 5" and 
z equal to 6. The result of binding is local: the values of y 
and z outside the binding expression are unchanged. 


ADFL contains a conventional conditional expression, 
but has an unusual iteration expression. Evaluation of the 
iteration expression: 


for idlist = exp do iterbody end 


is accomplished by first binding the jteration identifiers, the 
elements of fd/ist, to the values of exp. Note from the BNF 
specification of iterbody, that the evaluation of the iteration 
body will ultimately result in either an expression or the 
“application” of a special operator iter to an expression. 
This application of iter is actually a tail recursive call of the 
iteration body with the iteration identifiers bound to the 
"arguments" of iter. The iteration is terminated when the 
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evaluation of the iteration body results in an ordinary, non 
iter, expression. The value of this expression is returned 
as the value of the iteration expression. The following 
iteration expression computes the factorial of n: 


fori, y= 1,1do 
if i<ntheniteri+ 1, y *ielse y end 
end 


Syntactic restrictions which ensure that expressions 
are used only when appropriate in arity and type have been 
omitted from this discussion. Elsewhere in this volume, 
Dennis and Weng [8] define arity and type restrictions for a 
data flow language similar to VAL and, consequently, ADFL. 
Their language differs from ours in emphasis: They present 
an abstract interpreter with a dynamic allocation scheme for 
executing graphs and, accordingly, emphasize procedural 
control abstractions. We investigate the execution of 
statically allocated graphs (data flow machine language 
programs) and, accordingly, emphasize iterative control 
abstractions. 


Translation of ADFL 


The translation algorithm of ADFL consists of two 
functions ee mapping ADFL expressions into their data flow 
graph implementations, and ‘). mapping ADFL iteration 
bodies into their implementations. The graph implementing 
an expression or iteration body has an input port for each 
free variable of the expression or iteration body. For an 
expression exp which returns n values when evaluated, 
SJlexp]] has n output ports. Recall that evaluation of an 
iteration body will yield either results to be re-iterated or 
results to be returned by the containing iteration 
expression. The graph ‘) Literbody] has an output port 
iter? which signals which possibility has occurred and sets 
of output ports for each possibility: | output ports for values 
to be iterated and R output ports for values to be returned. 


The translation algorithm for ADFL resembles previous 
translation schemes of Dennis[6] and Weng[1i1]. A 
detailed recursive definition of the algorithm over the 
eleven cases of the BNF specification of the syntax of ADFL 
has been given by Brock [3]. For brevity, only the cases of 
the conditional expression, the conditional iteration body, 
and the iteration expression will be examined in detail. It is 
assumed that most readers, informed that the graph of 
Figure 1 may be re-labeled: 


S}fltet dx,dy = x1-x2,y1-y2 in sqrt(dx*dx+dy*dy) end] 
will discover the translation of the eight "trivial" cases. 


The graph <J[if exp, then exp, else exp, end]|_ is 
shown in Figure 2. The graph contains three subgraphs, 
Tllexp,]1. Tllexp.], and ‘][[exp,]], and several gates. The 
T gate has a control input port (entering its left side), a 
data input port, and an output port. When the T gate fires, 
it absorbs a token from each input port. If the control token 
is true, the data token is passed to the output port. If the 


Figure 2. “J[if exp, then exp, else exp, end] 
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control token is false, the data token is simply absorbed. 
No output token is The defined 
analogously. It passes its data token only if its control 


token is false. By passing inputs to “][[exp.]], respectively 


produced. F gate is 


TDlexp.]. through T qates, respectively F qates, controlled 


by the output of Vlexp,]. the proper subexpression is 
"enabled" during data flow evaluation of the conditional 
expression, The results of “Texp.] and ‘Jllexp.] are 
merqed by M gates. The M gate has one control input port, 
two data input ports, and one output port. Its control token 
selects the data token to be passed. If the control value is 
the error value undef; each T or F qate absorbs a data 
token and produces no output tokens, and each M gate 
produces undef and absorbs no input tokens. Thus, data 
flow evaluation of a conditional expression yields a tuple of 
undef's if the condition is undef. 


Tit exp then iterbody, else iterbody, end], the 
conditional iteration body qraph- illustrated in Fiqure 3, is 
similar to the conditional expression qraph. With Tvand F 
qates, the output of the expression subgraph, Jlexp], 
enables one of the iteration body subgraphs, ‘) [literbody ,] 


and “) Literbody,]]. The selected subgraph will produce 
output at either its bor R output ports, according to its iter? 
output: true, for | outputs to be iterated; false, for R 
outputs to be returned. Using the output of the expression 
subqraph and the iter? outputs of the iteration body 
subgraphs, the IC gate calculates three /teration control 
outputs: the qraph iter? output and the control tokens for 
the M gates producing the qraph | and R outputs. The table 
at the bottom of Fiqure 3 gives the firing rules of the 
IC gate. Note that, if the output of the expression subgraph 
is undef, the conditional iteration body graph will produce 
false at its iter? port, thus announcing termination of 
iteration, and will produce undef at its R output ports. 


The = graph “Tor idlist = exp do iterbody end] is 
shown in Fiqure 4. This cyclic graph is formed by using 
M gates to merge the outputs of TDlexp] and the | outputs 
of ‘) [Literboay] and by routing the merged outputs into the 


input ports of J) Literbody] labeled by identifiers of id/ist. 
The control input port of each M gate is connected to the 
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Figure 3. Tit exp then jterbody, else iterbody, end] 


(Tiltervodyd )( Fitertody] 


iter? | R 


IC gate firing rules 


inputs outputs 
true true - true true . 
true false - false - true 
false - true true false - 
false - false false - false 
undef ~ - false ~ undef 
Figure 4. ~J[[for idlist = exp do iterbody end] 


Ceol) 


false 4 *) Literboay]] 


iter? output of “) Liter body]. The connecting arc contains 
an initial false token to ensure that the first data value is 
selected from TWexp]. Thereafter, data tokens are 
selected according to iter?. A true iter? token, signalling 
continued iteration, selects the data tokens of the | output 
ports. A false’ iter? token, — signalling termination, 
re-initializes the Mgates for subsequent — iteration 
expression evaluations. Identifiers which are free in 
iterbody but are not contained in idlist are routed through 
S gates. For false control tokens, the S gate absorbs, 
stores, and outputs its data tokens. For true control 
tokens, it produces its stored value and absorbs no data 
tokens. Thus, the S gate stores new values when 


evaluation of the iteration expression begins and produces 
them at each subsequent iteration step. Like the M gate, it 
is initialized with a false control value. 


Brock [4] has verified this translation algorithm by 
proving it to be consistent with a denotational [10] 
specification of ADFL. In the proof, data flow arcs are 
assumed to be implemented by infinite (unbounded) queues. 
The transformations described subsequently will relax this 


requirement without affecting the correctness. of 
translation. 
Transformations of Data Flow Graphs 
In proposed data flow ==machines’ of the 


Dennis-Misunas [7] design, operations are held in instruction 
cells which contain a register for each input arc. These 
registers are effectively an implementation of data flow 
arcs as queues of capacity one. The implication of the 
bounded arcs is that operators must be prevented from 
producing new tokens until their output arcs are empty. 
This behavior is ensured by modifying the firing rules so that 
no operator is enabled if a token is present on any of its 
output arcs. 


By performing a transformation, illustrated in Figure 5, 
which replaces each arc of the graph by an appropriate 
data/acknowledge arc pair (d/a arc pair), the effect of the 
modified firing rule can be explicitly built into the graph: 
The presence of a token indicates that the corresponding 
data arc is empty. As a consequence, operator firing rules 
revert to the original format of depending only on the 
presence of tokens on input (including acknowledge) arcs, 
where the previous enabling requirement that output arcs 
be empty has been replaced with the requirement that 
acknowledge inputs be present. 


Montz [9] and Dennis and Misunas [7] have shown 
that graphs of data flow programs may be executed without 
deadlock when arcs are implemented as data/acknowledge 


Figure 5. Replacement of one-place buffers with me 
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aaa ‘Figure 6, 


pairs. Consequently, the correctness of the translation 
algorithm is not affected by this transformation. However, 
the implementation is not without cost. Aside from the 
obvious overhead involved in incorporating acknowledge 
arcs and tokens, the constraints which they impose on the 
token flow through the graphs may cause bottlenecks. In 
response to these issues, Montz[9] has developed 
optimization techniques specifically aimed at either 
increasing the throughput by balancing the token flow or 
decreasing the overhead by removal of unnecessary 
acknowledge arcs. 


Balancing Token Flow 


The goal of the optimization to balance token flow 
through the graph is to increase throughput by modifying the 
graph to display maximum pipelining. The bottleneck 
problem, and therefore application of the optimization, arises 
in acyclic segments of a data flow graph. A clear illustration 
of the problem and solution is shown in the Figure 6 graph, 
the implementation of the ADFL expression: ‘J[if f=1 then 
f1 else f2 end]]. Although successive sets of inputs should 
be processed simultaneously, the control structure of the 
graph dictates that the overlap be very minimal. !n order for 
a second set of values to enter the branches of the 
conditional, both a and 6 (Figure 6) must fire a second time 
presenting the sets of T and F gates with new control 
inputs. However, a cannot fire a second time until the 
M gate to which it also sends a control input has fired, to 
produce an acknowledge. Thus the d/a arc pair connecting 
a and the M gate (shown with slashes in Figure 6) creates a 
bottleneck whose severity depends on the depth of the 
computations performed within the branches of the 
conditional. 


Eliminating this behavior so that successive sets of 
values may pipeline through the graph can be accomplished 
by inserting identity operators (buffers) along the slashed 
arc, breaking it into d/a arc segments which consequently 


Insertion of buffers for a_ conditional — 
expression 


allow a to fire several times before forcing the M gate to 
fire. For the Figure 6 graph, this is accomplished by 
replacing the slashed arc with the arc segment shown to its 
immediate left. To generalize this optimization technique, a 
determination of the ideal number and location of inserted 
buffers must be made. This requires an analysis of data 
flow graph execution. 


Though the data flow computer is asynchronous, it can 
be made to model a synchronous machine by assuming that 
during any given unit of time all enabled operators must fire 
and produce a result. This approximates optimal program 
execution by preventing an enabled operator from remaining 
enabled and thereby slowing up processing for any length of 
time. 


Referring to Figure 6, we note that each input set to 
the graph will result in the production of a token on the 
control (slashed) arc and tokens that will be processed by 
either f7 or f2. While under the "synchronous machine" 
assumption the tokens being processed by the functional 
Operators can move one step through the graph during 
every time unit, the control token on the slashed arc cannot, 
restricting throughput to an output every fifth time unit. 
Adding identity operators to equalize buffer capacities 
achieves maximum pipelining, or equivalently, the optimal 
throughput of an output every second time unit. The 
algorithm presented below equalizes buffering. 


Algorithm to Maximize Pipelining 


Starting from each graph input, descend through the 
graph assigning consecutive numbers to the arcs 
joining successive sets of operators until a 
multi-input operator is encountered. Compare the 
arc numbers on the input arcs of the operator and: 


(a) if equal, continue the arc numbering process 

(b) if not equal, balance the arcs by inserting 
identity operators into the lower numbered 
arcs. Renumber the modified arcs and 
continue the arc numbering process. 


Note that if the operator is an M gate, the comparison and 
balancing process described above must involve all three 
input arcs, using the highest numbered arc as the goal. 
Figure 7 shows the result of applying this algorithm to the 
graph translation of the following program segment: 


if f=1 then if s=1 then x*(y+1) else x*(y-1) end 
else x*y end 


For reference purposes, the added identities have been 
numbered. Identities !1 and 12 have been added in 
response to the imbalances which occur when comparing arc 
numbers on the input arcs to the multiplication operators. 
13 through 15 are added in response to the comparison of 
input arcs to the inner M gate. Note that as specified in the 
algorithm, arc number comparisons involve all three M gate 
input arcs. Finally, operators I6 through 115 are introduced 
as a result of comparing input arcs to the outer M gate. 
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Figure 7. Example of maximal pipelining 


In applying the algorithm to this example, there are 
several interesting observations to make. Recall from the 
algorithm, that M gate comparisons must involve the two 
data arcs and the control arc. The algorithm modifies the 
graph to achieve maximum pipelining by making buffering 
capacities of the paths through the graph to the control arc 
and two data arcs the same. However, while each branch 
of the conditional operates in conjunction with the control 
arc, the branches themselves are independent. Thus, while 
each branch must pipeline with the control path, they need 
not necessarily pipeline with each other. I{f the two 
conditional paths are of different lengths, the pipelining 
choices available are to equalize the control path with 
either the shorter or the longer conditional branch, or to 
equalize all three. The latter of these, implemented by the 
algorithm above, achieves best throughput, but has the 
disadvantage of causing the insertion of additional identity 
operators in the shorter conditional branch. The other two 
choices recognize the independence of the two conditional 
paths and avoid excess buffering, but possibly at the cost 
of reduced throughput. | 


A factor not yet considered which interacts with this 
pipelining choice is the frequency with which graph paths 
are taken. In Figure 7 each input set can take any of three 
paths corresponding to the three possible states of f and s. 
lf, for example, the pattern of input sets is such that no one 
of the three paths is taken twice in a row, identity 
operators |1 and 12 would be unnecessary and could be 
removed without decreasing the throughput. Illustrations of 
this point can be found in Montz [9]. : 


The discussion of trade-offs and options to consider in 
maximally pipelining data flow graphs, indicates that the 
advantage of smaller size resulting from a less than 
maximally pipelined graph may be worth a decrease in 
throughput. Some key issues influencing the choice might 
include cost of identity operations, processor utilization, 
token flow patterns, and width and depth of program. By 
modifying the pipelining algorithm, we can produce data flow 
graphs which display /imited pipelining, meaning that the 
delay between an_ operator's firing and receiving 
appropriate acknowledge signals may be several time units. 
For example, it is possible to specify that the delay in 
sending acknowledge signals be no greater than two time 
units. The change to the algorithm, which involves balancing 
arcs to within a specified bound, allows a graph to be easily 
reconfigured to display different degrees of pipelining, and 
thereby provides a feasible and practical contro! method of 
studying varying levels of pipelining in a graph. Though the 
details of the modified algorithm will not be given, we 
proceed by briefly comparing the Figure 7 graph with that of 
Figure 8 which can be produced using a limited pipelining 
algorithm. 


The most striking contrast between the fully pipelined 
graph and this partially pipelined version is the large 
reduction in inserted identity operators, from 15 to 7. The 
question which arises is whether the cost of this reduction 
is a decrease in performance, where the Figure 7 graph 
displays the optimum performance by producing an output 
every second time unit. An analysis of several token flow 
patterns using different successions of input sets shows 
that the limited pipelining scheme does not’ necessarily 
degrade the throughput. This can be seen by pipelining 


Figure 8. Example of limited pipelining 
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three sets of inputs through the Figure 8 graph assuming 
that they respectively follow the paths indicated by the f-s 
values: true-true, false, and true-false. 


Once an actual data flow machine is available, a study 
of the number of inserted identity operators vs. throughput 
trade-off should provide insight into the direction to take 
concerning optimization. This information in combination with 
a particular application should indicate other optimization 
possibilities; for instance, concentrating on only the main 
source of bottleneck within a graph. For the conditional 
construct this point appears to be the control arc to the 
M gate. Modifications of the pipelining algorithm could also 
be weighed more realistically as alternative approaches. 


A final point to note in the consideration of this 
pipelining optimization strategy is that conditional constructs 
and general compositions of operators turn out to be fairly 
representative of the type of graphs for which this 
optimization is applicable. In fact, this optimization approach 
is basically inappropriate for an iterative process whose 
function is to modify’ and recycle a single set of inputs at a 
time (although subgraphs within an iteration may be 
pipelined). Thus an alternative optimization which aims to 
minimize the number of acknowledges in a graph by 
eliminating those which are unnecessary has been 
developed. 


Eliminating Acknowledges 


This optimization technique aims at decreasing 
overhead by removing acknowledge arcs which are not 
necessary to maintaining safe operation. This safety 
requirement is equivalent to guaranteeing that at most one 
token will reside on any arc of a data flow graph at any 
time. An examination of various ADFL constructs leads to 
the identification of arc pairs which are candidates for 
acknowledge arc removal. The strategy will be to develop a 
rule specifying the requirements for acknowledge arc 
removal for each candidate arc pair identified in the 
construct. By recursively applying the resulting set of rules 
to the data flow graph translation of an ADFL program, 
acknowledge arc removal for all candidate arc pairs can be 
determined. 


To illustrate the analysis and formulate the desired 
rules, we begin by considering the data flow graph 
translation of the general conditional construct shown in 
Figure 6. As in the preceding section, the discussion 
centers on the arc pair connecting a and the M gate. 
However, while overcoming the restricting behavior of this 
arc pair was the focus of that optimization aimed at 
increasing pipelining, the restriction is an advantage to the 
process of eliminating acknowledges. Specifically, a, which 
cannot fire a second time until it receives an acknowledge 
from the M gate, guarantees that a second input set will not 
be within the branches of the conditional until processing of 
the preceding set has completed. Each input set (which will 
be processed by either f7 or f2) places a token on the 
controlling arc of the M gate and a data token on each of 
the arcs labeled either a and b, or c and d, depending 


respectively on whether the control token was true or false. 
Assuming that f7 and f2 are well-formed, an output should 
appear on arc g (assuming the control token was true) 
within finite time, with the impossibility of a second token 
appearing on arc g, or of any token appearing on arc h until 
the M gate has fired. This firing simultaneously processes 
the token on arc g and sends an acknowledge token to a, 
consequent to which a successive input set may enter a 
branch of the conditional. This behavior guarantees that the 
acknowledge arc of the arc pair denoted by g can be safely 
removed. By an analogous argument we can remove the 
acknowledge arc of the arc pair labeled h. 


Using similar reasoning one might be tempted to 
remove the acknowledge arcs from arc pairs a, b, c, and d 
under the assumption that once a set of tokens has entered 
a branch of the conditional, the tokens must be used by the 
appropriate function to produce the corresponding output. 
However, a consideration of the Figure 9 data flow graph 
will show that removal of acknowledge arcs for these arc 
pairs is dependent on the subgraphs represented by f7 and 
f2. 


The Figure 9 data flow graph is a translation of the 

following ADFL program segment: 
if f=1 then if s=1 then x*(y+1) else x end 

- else x*y end 
Assume that the outer decision operator evaluates to true 
and that of the inner conditional construct previously 
represented by f7, evaluates to false. The important point 
to note is that an output can be produced using only the 
tokens on arcs a and b. The token on arc c need not 


Figure 9, Unsafe token configuration resulting from 
removal of c's acknowledge arc 


52 


propagate through the graph, and may in fact still be on the 
arc when a successive set of values arrives. Removal of 
c's acknowledge arc would make it possible to reach the 
unsafe token configuration shown in Figure 9. This example 
shows that the necessity of acknowledge arcs for arc pairs 
a through e is dependent on whether or not their values are 
guaranteed to be used in producing the outputs of their 
appropriate subgraph (f7 or f2). An analysis of the 
subgraphs in Figure 9 reveals that tokens arriving on arcs a, 
b, d, and e must be used to produce their corresponding 
output, while the need of a token arriving on arc c is 
dependent on the outcome of the inner decision operator. 
Therefore, we must leave c's acknowledge arc, but can 
remove those of arc pairs a, b, d, and e. 


This analysis, specific to the conditional construct, 
results in designating all input arc pairs to the f? or f2 
subgraphs subject to rule C1 with regard to acknowledge 
arc removal: 


C1: The acknowledge arc of an input arc pair to a 
subgraph may be removed if any token arriving on 
the arc must be used in producing the output of the 
subgraph. 


This form of analysis must be recursively applied to 
subgraphs in determining acknowledge arc removal for both 
inner constructs and outer arc pairs. It is interesting to 
note that this rule could be applied at the source level by 
taking the intersection of variables appearing in the then 
and else clauses. Variables found in the intersection would 
be guaranteed to be used in producing the output, and in 
graph form would not require acknowledge arcs. 


Referring to Figure 6, the arc pairs presenting inputs 
to the T and F gates have not yet heen discussed with 
regard to acknowledge arc removal. Since the only way to 
guarantee the absence of a token on any of these data 
arcs is via the presence of a token on the corresponding 
acknowledge arc, these acknowledge arcs must remain. A 
final point concerns the initially discussed control arc 
connecting a with the M gate which may not need an 
acknowledge arc. The control arc of the inner conditional 
construct of Figure 9 is an example of such an occurrence 
which can be characterized by rule C2: 


C2: The acknowledge arc of the control arc 
connecting a and the Mgate of a_ conditional 
construct can be removed if the acknowledge arc of 
the output arc pair of the Mgate has been 
removed. 


Developing a complete recursive algorithm to determine 
acknowledge arc removal in data flow graphs requires this 
type of analysis for each ADFL construct. 


As a second example, we briefly examine the iteration 
construct shown in Figure 10 to identify candidate arc pairs 
for acknowledge arc removal. The arc labeled Y,,,,.. the 
control output of the iteration, provides the controlling value 
for the sequence of M gates handling the presentation of 


Figure 10. ‘J[[for idlist = exp do iterbody end] 
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successive sets of inputs to the iteration body. Since the 
Vite? Value is dependent on at least some of the M gate 
inputs, a number of them must fire before a second V iter? 
value is produced. This necessarily implies the firing of the 
copy operator, "L", to present the M gates with new control 
inputs needed to re-enable them, ensuring that the V iter? 
output arc from the iteration body to L must be empty for a 
successive Y,,,,> value to be produced. Consequently, the 
Vite? arc needs no acknowledge. No such guarantee can be 
made for the arcs between the copy operator and M gates, 
acknowledges for which can be_ conditionally removed 
subject to rule T1: 


T1: The acknowledge arc for an arc pair between 
operator and the sequence of M gates can be 
removed if its data value must be used in producing 
the 7.6.2 value. 


The output arc of the iteration body labeled | 
represents the arc pairs for the iteration variables: The 
analysis for these arcs is more complex and is governed by 
the following rule: 


T2: The acknowledge arc of an I (iteration) arc pair 

can be removed if e/ther 
(1) The iteration body cannot emit a value on 
that output arc until it has absorbed the 


corresponding input value on the 
corresponding input arc. 
(2) The Y,.,. value depends on the 


corresponding input arc. 


Examples involving the iteration construct, as well as an 
expanded discussion of these rules and an analysis of the 
remaining arcs can be found in Montz [9]. 
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Conclusions 


We have described a data flow language, an algorithm 
for translating its programs into data flow graphs, and two 
techniques for optimizing these graphs for execution with 
data flow machines of the Dennis-Misunas [7] design. While 
the two optimization methods have been presented as 
isolated techniques, they must be integrated into a single 
procedure for application to a given program. 


We have not compared the costs of operation of the 
Dennis-Misunas [7] computer design with that of the 
Arvind-Gostelow [2] design, which avoids conflicts through 
the use of tagged values rather than acknowledge tokens. 
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Abstract -- This paper describes an analysis 
of the major sources of overhead in multiprocessor 
systems with emphasis on performance equations for 
large systems. A model is developed for studying 
the relative contributions of these sources of 
overhead. The traditionally treated problem of 
memory contention is shown to be containable with- 
in bounds with limit equations provided. Software 
control table lockout on the other hand is shown 
to be beyond containment in large systems such 
that an upper limit on performance exists. Effec- 
tive methods of reducing lockout overhead are 
explored. Control program efficiency is shown to 
be the only means of achieving very large multi- 
processor systems which are efficient. It is 
shown also that if such efficiency could be ob- 
tained in a centralized control mechanism (by hard- 
ware or other means), there are no other immediate 
theoretical problems associated with increasing 
multiprocessor size. 


Introduction 


There are known limitations to single proc- 
essor approaches to increasing general purpose 
computer throughput capabilities [8]; moreover, 
requirements for increased throughput seem more 
general and insatiable than ever. The advent of 
inexpensive microprocessors has emphasized the 
necessity for an effective multiprocessing tech- 
nology capable of effectively combining many proc- 
essors to obtain significant throughput. The cost 
advantages of multi-microprocessors over high 
speed main frame processors provide a natural 
motivation for re-evaluating the problems pre- 
viously encountered in large MIMD multiprocessing 
systems. It is therefore the limiting performance 
behavior where many processors are involved that 
is the central theme of this paper. 


The theoretical problems associated with dead- 
lock avoidance and synchronizing concurrent proc- 
esses have been solved. [3],[9],[13] The practi- 
cal problems however, which are encountered when 
implementing large multiprocessing systems have 
seemed unavoidable. To address these practical 
issues, a general parameterized model of the major 
overhead contributions in multiprocessing systems 
is presented. Descriptions of the individual over- 
head contributions modeled separately are found in 
the literature, but not integrated mathematical 
models as presented here. Nor has the emphasis of 


these other models been on performance expectations. 


in the limit as system size increases. The model 
described in this paper relates the three major 
contributions to overhead in multiprocessing sys- 
tems to the desired application program processing 
requirements in order to assess potential perfor- 
mance capabilities. A diagramatic illustration of 
the modeled sources of overhead is provided in 
Figure 1. These are the following: 
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1. System Control. The multiprocessor exe- 
cutive control program execution time requirements. 


2. Control Table Lockout. To provide co- 
ordinated control, common queues are required 
which imply critical sections in the control pro- 
gram which accesses these queues. 


3. Memory Contention. Common physical mem- 
ory for multiple processors requires the possi- 
bility of multiple processors converging on the 
same physical memory module, in which case a 
processor may have to wait until other processors' 
access requests have been serviced. 
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FIGURE 1: MODELED SOURCES OF OVERHEAD 


To obtain a comprehensive model of multiproc- 
essing overhead, without inappropriate complexity, 
a hierarchical model has been developed. The 
levels and states in this hierachy are the obvious 
ones. Figure 2 is a state diagram of the time 
expenditure states at the top level in this model 
of a multiprocessor system. These states are: 

P, the normal processor operations associated with 
instruction sequencing and performing the instruc- 
tions in its repertoire, and C, the memory delays 
which may include sequences awaiting memory con- 
tention resolution. In order for this model to 

be valid, both the spatial and temporal distribu- 
tions of memory access requests must be constant 
and independent of the changing occupations of 

the processor. These assumptions are character- 
istic of current multiprocessing. (One of the 
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FIGURE 2: TIME EXPENDITURE STATES 


design trades considered further on investigates 
potential advantages resulting from changing the 
temporal- distribution.) 


Time overhead (throughput) is the multiproc- 
essing concern here. Other aspects of multiproc- 
essing including memory and peripheral sizing have 
been modeled in reference [6]. These other aspects 
are very important in a system, and should be op- 
timized to obtain the best performance for any 
given configuration. But they are not the major 
obstacles to a viable multiprocessing capability. 


Processor Time Expenditure Model 


The time utilization characteristics of the 
various activities that can be assigned to the 
processor are modeled here. In a multiprocessor 
system, it is expected that for some of these 
activities the amount of time expended may be de- 
pendent upon the number of processors, N. (This 
definition of N will be assumed throughout the 
rest of this paper.) The P state of the processor 


shown in Figure 2 can be modeled in more detail ‘as 
shown in the state diagram of Figure 3. 
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states in this diagram are the following: 


1. Idle state, awaiting an eligible appli- 
cation program task, 


2. Application task execution, 


3. Control program execution, and 


4. Control table lockout. 


In order to get performance predictions in- 
dependent of the software configuration, it has 
been assumed that the idle state will be null. 

We are only interested here in performance degra- 
dation not attributable to insufficient jobs to 

go around. (Utilization considerations will be 
discussed later on however.) It is also assumed 
that lockout will only be experienced as a part 

of the control program execution, and is therefore 
called control table lockout. Critical sections 
in the application program are assumed to be re- 
solved by task eligibility considerations handled 
by the control program. To resolve such conflicts 
in the application programs is not the direction 
of high performance multiprocessing, since ex- 
cluding the parallel execution of such programs 
improves throughput. The timeline in Figure 4 
shows the phasing among the remaining three states. 
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FIGURE 4; TASK TIMELINE BASIS FOR PROCESSOR 
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Each of the three remaining processor time expen- 
ditures is modeled very simply in the following. 
An equilibrium situation is assumed among the 
states, so that the numbers of processors entering 
and leaving each state are approximately equal. 
The level of sophistication could obviously be 
increased appreciably in these models, but it has 
been found that performance predictions are re- 
latively insensitive to such improvements. The 
simpler models are easier to describe and under- 
stand, and fit existing multiprocessor performance 
data very adequately. 


Application Program Task Execution 


The model of application task execution in- 
volves a constant execution time requirement, A, 
for all tasks with a single queue/dispatch/exit 
control program request overhead. The model is 
still valid for programs making multiple requests 
so long as the ratio of application to control 
program execution time, P= A/@, is a contant. 

This ratio is used extensively later on in the 
analytical derivation of performance$3} it is called 
the individual processor efficiency. It is af- 
fected only by the control program overhead per 
application task, defined so as to exclude the 
effects of lockout induced by multiple processors. 


Control Program Execution 


‘The execution time of the control program is 
assumed to be broken into J partitions. These 
partitions are assumed to be mutually exclusive 
critical sections with equal execution frequency 
as well as execution time, @.. 

5 J 
G= 2 ¢. = J¢. 
j=l J J 


The control program is assumed to require the 
same constant total amount of execution time, @, 
for each task. It is also assumed that its exec- 
ution time is independent of the number of proc- 
essors in the system. The latter of these assump- 
tions supposes that queues are implemented with 
multiple pointers such that the lengths of queues 
do not result in a commensurable amount of search- 
ing to process linked task lists. This seems to 
be a unilateral approach to sophisticated control 
programs appropriate to multiprocessing. 


Control Table Lockout 


Coordination of the activities of many proc- 
essors to achieve a single computational objective 
requires the control program to have common task 
queues for exploiting the parallel aspects of in- 
dividual application programs. It is assumed 
that control table lockout occurs at entry to each 
of the J control program partitions, each of which 
is comprised of a mutually exclusive critical 
section. The total amount of lost time due to 
this control table lockout will be: 

J 


L = a where Le is the amount of lockout 


j=l 


attributed to the jth critical section. 


In order to derive an expression from which a 
value can be computed for the overhead L, we will 
define MN as the number of processors waiting and/ 


or executing the jth critical section in the con- 
trol program. From this definition it can be seen 
that the amount of lockout time a processor will 
experience before entering the jth critical sec- 


tion will be L, = N.%. = N. —. N, can be deter- 
mined as the probability se of an individual proc- 


essor being in this jth state, times the number of 
possible competing processors, N-l in this case. 
The probability E can be determined as the pro- 


portion of time spent in the jth state to the 
total amount of time spent by each processor. 


¢.+L. (14N ,) 
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Thus, since Ne = BAND we obtain a second 


order equation for ne 


1+N,)*(N-1 
(1+N,) * (N-1) 
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The formal solution to this equation is: 
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For J=1, we obtain: 
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The expected number of locked out processors, 


Ny is plotted in Figure 5 for various values of (¢0. 
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These curves are in agreement with Madnick [15] 
in spite of a very different derivation. The 
significance of increasing the effective parti- 
tions in the control program will be discussed 
further on. 


Combining Processor Time Expenditures 


The objective of the processor activity mo- 
deling has been to obtain insight into the rela- 
tive amount of time spent by each processor in 
its A, @ and L states. Equivalently we are inter- 
ested in knowing the total number of processors 
in the configuration occupied by each activity. 
This assessment can be obtained by establishing 
the ratios of time spent in each activity to the 
unit of a processor's time. By defining X,, Xq> 


and X. as the respective ratios for the A, @, and 


L 
L activity states, it can be seen that: 
A g L 
a eS a ee 
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Furthermore, the equivalent number of proc- 
essors involved in each activity per unit time 


Nas Ng? and Ny can be determined as: 
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In order to establish these relative contri- 
butions, we can substitute in the results obtained 
previously. The unit of processor time for J=1l 
is thus seen to be: 


U =O+1+L/¢@ = EZ Ve-0)" +P 


a e e mL 
N, = P-N- U 
_ qe 
Ny =N- U 
e e rl 
= (U-—-1) N° U 


Memory Contention Delay Model 


There are various memory/processor intercon- 
nection schemes that can be employed for access 
arbitration including multiport controllers and 
crossbar switches as described by Enslow [10] 
which effect the logical interconnecting paths 
shown in Figure 1. Specific configuration depen- 
dencies such as processor clock phasing, memory 
address interleaving, processor to memory speed 
ratios, and processor memory request duty cycle 
are discussed in reference [17].: The mathematical 
modeling of the performance to be expected of con- 
figurations incorporating such dependencies is 
addressed here. 


In a general multiprocessor configuration with 
M memories and N processors, we are concerned with 
the percentage of time that the processors spend 
waiting for a memory to service their requests 


[2], [12]. 


General Model of Synchronous Interleaved Memory 


To simplify the model we have assumed equal 
likelihood of a processor accessing any of the 
memories on a given request. Address interleaving 
makes that a realistic assumption. [Im addition, 
it has been assumed that each processor synchro- 
nously makes a memory access each cycle; this is 
a worst case situation tending to make the result- 
ing performance predictions pessimistic rather 
than optimistically unrealistic. 


We will begin by defining the probability, 
Poi) of exactly i processors converging on single 


memories anywhere in the system on a given access: 
[J 
ad 


P. (i) = P,(i,5), where lLxJ is the largest 


jel 
integer less than or equal to x, and P, (i,j) is 


the probability that there are j instances of 
exactly i processors converging on single memories 
in the system. (For a detailed treatment of pro- 
bability theory, refer to Feller [11].) To pro- 
ceed, we will consider the conditional probabili- 
ties PiG,d)> and pftsd) which are respectively 


the probabilities of a processor and a memory 
being: involved in an i-way convergence of proc- 
essors on memories if there are j instances of 
such convergence in the system. Under the random 


accessing and equivalence between processors as- 
sumptions that we have made: 


p, (isd) =t'y, since ixj of the N processors are 


involved. 


p (id) since j of the M memories are involved. 


Now the unconditional probabilities of processors 
and memories being involved in i-way convergence 
situations can be determined as: 


(4) 4) 

P (i) = 2 p (4.5) * P3C.d) = v2 bi) 9 
2) (FJ 

Ei Do ad) pe Se Oy Ped) 
j=l j=1 


pe 2, ce Me ; 
And therefore: PW ig P th) 


Modeling Memory Response Time 


So far we have only been dealing with the 
probabilities of processor/memory convergence, 
whereas what we are really interested in is con- 
tention situations where processor time is lost. 
We therefore assume that there is some number, k 
(not necessarily unity, but for convenience a 
positive integer) of processors whose requests can 
be accommodated by each memory module without any 
of the contending processors experiencing delays. 
k is the ratio of processor request time over 
memory response time. A new conditional probabil- 
ity, PR (i) can therefore be defined which is the 


probability that a processor involved in an i-way 
convergence situation will actually experience 
contention: 


- (i- k 3 : P 
PR (i) = {22 for i>k; P, (i) = 0, otherwise. 
Then the probability a processor will experience 
memory contention due to i-way convergency situ- 


ations is: 


PoCi) = PR(i) * PRCd) 

Be yout He bre ; a 
Pod) Ci k) N P Sh)» for i7k; 
P, (i) = 0, otherwise. 


The total probability of a processor experi- 
encing memory contention P_, can be computed as: 


C. 
N 
Pa = > Po (i), since contention can only occur 
i=kt+1 
when i>k. Therefore we have: 
wee 
Pale 2 PC) ee) 
i=k+1 


Approximating the Distribution Function 


We are left then with the requirement for ob- 
taining a distribution function P eh): Many such 


models of processor queueing on individual memor- 
ies have been advanced [2],[7]. It has been shown 
that little accuracy advantage accrues from se- 
lecting the more sophisticated models involving 
Markov chains. This is particularly applicable 
for the configurations discussed in this article 
where memory contention is shown to be small, 
since we are primarily interested in configurations 
for which M>N and k>1. Bhandarkar [5] has shown 
percentage errors of less than 5 percent in all 
cases for the model assumed here. 


The model that we have selected is the bi- 
nomial approximation of Strecker [16] which was 
found to "work well in all cases" by Baskett and 
Smith [4] and with more accuracy for M>N by 
Bhandarkar [5]. This model is precisely valid 
for the initial allocation of processors to mem- 
ories under the assumptions made previously. 


According to this model, the probability that 
exactly i processors converge on a given memory 
module on a given cycle is: 

N-i 


pa) =(")(2) G4) 


i/~ i!(N-i)! 


Therefore, according to this model: 
N 
_M 
Ba De (4) a" 


: i=k+1 
jas a function of the number of 


N! Gi-k) k) = 
i!(N-i)! 


The form of P. 


memory modules is shown for N=20 processors in 
The impact of varying the relative 


Figure 6. 
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FIGURE 6: IMPACT OF RATIO OF PROCESSOR REQUEST 
TO MEMORY RESPONSE TIMES 
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speed, k of memory access and processor request 
logic is illustrated in the figure, applying re- 
spectively for k=l, 2 and 3. 


It should be noted that all of the conver- 
gence and contention probabilities are functions 
of M, N, and k, specifically Po = P. (M, N, k). 


The probability distribution functions rs and Pa 


are functions of M and N as well as i, e.g., 
PC) oak tM, N). 


Limiting Behavior of ''Square" Systems 


It is interesting to note that memory conten- 
tion decreases very rapidly with M until the num- 
bers of memories and processors are approximately 
equal (M=N), and very slowly thereafter. We will 
refer to systems for which M=N as "square" multi- 
processors, and define the notation: P (MEN ,,k) 


= P (M,M,k) = P ANN, k). To understand the signi- 


ficance of configuring multiprocessors with ap- 
proximately equal numbers of memory modules and 
processors, consider the limiting values of 

P (4N,k) as M and N become large. The limits of 


the summation can be changed to obtain: 


M 2 M as 
rig doy ek) PG) Se ee) 
i=0 i=0 
N 
Then noticing that », P mer? M,N)=l1, we obtain: 


i=0 


ts 


N ae oe 
2 i P (a) cae " >, (i-k) P Ch) 
i=0 i=0 


To obtain a limit for Pw we have substituted 


M=N into P st oMN) and used the limit 


i,” 
N 


ie Limit (1 


Noo 


Then for "square" systems: 


Limit P_ (M=N,k) = 
Cc 
N>o©o 


k 
i 
Ke RS 


The limting values for k=1,2, and 3 are 
shown in Table I. 


LIMITING MEMORY CONTENTION PROBABILITIES 


TABLE I: 


Limiting Conten- 
Probability 

Limit P (M,N)) 
M, No 


Asymmetry Ratio 
(Numbers of 
Processors to 
Memories ,N) ) 


Relative Speed 
(Memory to 
Processor, k) 


Incorporating Access Duty Cycle 


In real systems there is typically not exact- 
ly one memory access per processor per request 
cycle, and the processors are not synchronized 
relative to whether they actually access memory 
on a given cycle. There are two typical processor 
characteristics which are responsible. 


1. Processor operations do not typically re- 
quire an access on every cycle of the instruction. 
Statistically, somewhat less than half of the TI 
9900 microprocessor machine cycles require a 
memory access, for example. 


2. Some processors implement a cache memory 
scheme for look-ahead memory accessing to reduce 
the average wait time in the processor. This 
reduces the number of cycles for which the proc- 
essor makes memory accesses, but substantially 
increases the number of accesses outstanding 
when they are made. 


These (in general combined) phenomena estab- 
lish an effective, although statistically varying 
memory access duty cycle. These characteristics 
of real systems cannot be modeled by varying the 
memory to processor speed ratio, k. However, at 
least where large numbers of processors are as- 
sumed, and approximately constant access duty 
cycle, d can be expected which will alter the 
apparent number of processors actually making mem- 
ory accesses at any particular cycle to an equili- 
brium value for large systems of N= d.N'. Real 
"square'’ systems would then be characterized by 
the model as "rectangular" systems of dimensions 
N= Y).M, where 


Limiting Behavior of "Rectangular" Systems 


It is interesting to consider memory con- 
tention effects when system size is increased in 
congruent rectangular form. Just as was the case 
for "square" systems, it can be seen that for 
large "rectangular" systems the contention prob- 
abilities level off to approximately constant 
values. Chang, Kuck and Lawrie [8] derived an 
expression for the limit from the memory's view- 
point (the probabilitiy of a memory rather than a 
processor being involved in a contention situa- 
tion). The results do not incorporate the speed 
ratio, k. 


Limiting processor contention in large "rec- 
tangular" systems can be derived using the same 
approach as described previously for "square" 
systems. 


(k-i) i-1 


k 
k,i2 
i= i! 


Limit P 7 ; ae 
imit ‘ (N= 1)°M,k) 1 n ; 0 


Meco 


Accuracy considerations relying on Bhan- 
darkar's [4] data suggest Y) < 1 as the primary 
domain of usefulness for this equation. The 
limits for k=1 and for asymmetry values f =1, 
1/2 and 1/3 are shown in Table I. The asymp- 
totic approach to these limits is shown in 
Figure 7. 


Y) is the apparent asymmetry ratio. 
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Combining Processor and Memory Contention Overhead 


In the previous accounting of processor time 
expenditures, there were only three categories 
corresponding to the three processor states of 
application program, control program and control 
table lockout. It must now be acknowledged that 
not all of the time spent in these three states is 
correctly attributed to these causes, since memory 
contention takes a proportional amount of time 
from each. By this assumption, we have: 

C=(AtOtL) °P Thus, if we define the respective 


primed quantities to represent the time in each 
state exclusive of memory contention, we have: 
A+@+L=A'+@'+L'+C and therefore: 

A'+O'+L"=(ATB+L) (1-P |) and the respective number 


of processors in a multiprocessor configuration 
expended in the various states are the following: 


Ny - Ny (1 - P. 
No = Ng (l - Pp. 
Ny = Ny (1 - Pp.) 
No = (Ny + Ng + N,) : Po =N° Po 


The form of No is independent of Na Nop and 
N: No 
size for congruent rectangular increases, with 
the slope depending upon the relative speed of the 
memories and processors and the asymmetry ratio. 
This phenomenon is shown in Figure 8. The dashed 
line represents the extrapolation from data pre- 
sented by Bhandarkar [5] which resulted from a 
more accurate Markov chain model for k=l, Nel. 


increases linearly with increasing system 


The form of the other three expected numbers 
of processors N!, Ng and Ny can be obtained by 


substitution from previously obtained solutions 


for Nas Ng» Ny and Poe It should be clear that 


we 
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FIGURE 8. Ne FOR "SQUARE" SYSTEMS 


NA provides a desirable measure of throughput in 
multiprocessors. It provides the effective number 
of processors being applied to the application 


programs. 


To understand the importance of individual 
processor efficiency on multiprocessor throughput 
performance, it is interesting to look at the 

' é 
form of Na (P): 


P°N:(1-P,) 


Z 
ye. 1eB" sp 


2 


' — 
Na 


For large N there is an asymptotic approach 


to a limiting throughput, T , and this limit is: 
T = Limit Mi = —° (1-P,) 
N > © 


The trailing factor may approach a limit as 
well, since in general re is a function of N. 


Thus, the control program efficiency not only 
determines the utilization per processor, but also 
the maximum achievable throughput of the entire 
machine. In Figure 9 (which represents state of 
the art capabilities in large scale multiprocessor 
systems) there is a maximum achievable return 


a i, Yan Aa 


(even w with P =0) of two equivalent processors 
pe emeeremeeremrer ieee enenrermrsn {Speen a ener nent a eneNenttneeteCoRLCTN 


applied to application programs. By adding any 
number of processors beyond 4, the most that will 
be gained is 0.35 equivalent processors applied 
to application programs. 


The previous equation also indicates the im- 
pact of memory contention on maximum performance. 
Memory access efficiency, Pa the probability of 
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FIGURE 9: 


not experiencing contention on an access can be 
defined as follows: Py = LP, Then maximum 


throughput, T , for the whole system is equal to 
the product of the efficiencies of an individual 
processor, Po» computed with no contention or 


lockout, and Ps 
Par 


of the memory accesses: 
p 


Design Trades in Multiprocessors 


It is significant that in the example shown 
in Figure 9, memory contention is not responsible 
for the reduced efficiency of processors as a 
function of their increased number. This is not 
to say that memory contention cannot be a very 
Significant overhead factor, but rather that it 
is a problem which has been solved by the existing 
multiprocessing technology. In the example, mem- 
ory contention is reduced to insignificance by the 
large number of memory modules (M>>N). Another 
method which solves the memory contention problem, 
which is particularly appropriate in microprocessor 
systems, is increasing the relative speed of the 
memories. These solutions are appropriate re- 
spectively to large mainframe configurations 
requiring a large memory base to perform their 
normal operations, and to microprocessor-—based 
systems for which it is not a stringent require- 
ment to obtain relatively fast memories. 


Reducing Memory Contention 


There are of course many configurations for 
which memory contention appears to be very signi- 
ficant. In the solid lines in Figure 10, the 
situation previously presented in Figure 9 has 
been modified to include only 5 rather than 50 
memory modules. In this example, there is actu- 
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FIGURE 10: REDUCING MEMORY CONTENTION 


ally a negative improvement in application program 
throughput for more than 4 processors. The reason 
for this negative return can be seen to be attri- 
butable to the increasing number of locked out 
processors. These processors are assumed to 
access semaphores in main memory and thereby con- 
tribute heavily to memory contention and are not 
productive even when successful. This phenomenon 
can be eliminated by assuming the semaphores are 
stored in a special purpose memory dedicated to 
semaphore control. In this case the ratio of the 
numbers of processors in the three processor 
States independent of contention are the same. 

The effective number of processors competing for 


memory is reduced, however, to N,tNo: By esti- 
mating Po for 5 memories and N tN processors, we 


obtain the revised overhead plots shown as dotted 
lines. The marginal gain in performance for few 
processors can be seen. Memory contention has 
been effectively reduced, but the advantage has 
largely been taken up by increased lockout and 
system control overhead. This example illustrates 
the very important point that memory contention 
can be reduced to insignificance without a commen- 
surable return in throughput. See also Flores [12] 
for a similar conclusion. Memory contention is 
not the peril of multiprocessing. 


Reducing Processor Lockout 


It is clear from the precediAg discussion 
that lockout is the primary contributor to multi- 
processor inefficiency for large numbers of proc- 
essors. Let us therefore consider various means 
by which it can be reduced. Thé starting point 
of course is the consideration of the assumptions 
that went into the model of control table lockout. 
The primary assumption was that the control tables 
are locked out throughout the execution of the 
control program. Thus, the approaches in attempt- 
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ing to resolve the control table lockout problems 
are: 


1. Design a control program employing a more 
limited use of lockout, 


2. Reduce the execution time of the critical 
sections in the control program, and 


3. Partition the control program into many 
separate rather than a single common critical 
section. 


The relative effectiveness of the various 
methods of reducing lockout overhead ultimately 
depends upon the design of the control program 
itself. There are upper limits for each of these 
methods. The amount of processing power released 
to application programs as the result of improve- 
ments in these areas will be discussed below. For 
few processors (small N) the advantage of reducing 
the length of critical sections or increasing the 
number of partitions is negligible, whereas an 
improvement in control program overhead is an 
immediate advantage even for few processors. For 
large N the improvement in performance has the 
same form for reducing extent of critical sections 
and improving efficiency. 


It should be apparent that these three solu- 
tions have direct analogs in the reduction of mem- 
ory contention which are respectively: Reducing 
accesses to common memory, increasing the relative 
speed of memory response logic, and increasing the 
number of independent memory modules which can be 
accessed. Solutions incorporating the three ap- 
proaches to lockout are illustrated in the follow- 
ing discussion, with the improvements all being 
relative to the system whose performance charac- 
teristics were shown in Figure 9. Line A in 
Figure 11 represents this baseline system's 
throughput performance. 
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Limiting Control Program Lockout. It is not actu- 


ally necessary for the entire control program to 
be locked out such that only one processor can be 
executing it at any one time. In Figure 11, line 
B, the expected performance is shown for a system 
whose control program need be locked out only half 
of the time. The data for the figure was obtained 
using a different value of @ for lockout than for 


determining the proportion of useful work performed. 


P rocKout can be computed as the total amount of 


time the processor spends in non-locked-out proc- 
essing states divided by the amount of processing 
time spent in states for which lockout is required. 
In the current model this can be expressed as: 


Ne) Fe A+@(1-Z) _ O+1 
LOCKOUT ZO Z 
proportion of the control program requiring lock- 

out. In Figure 11, line B, (O =2, Z=0.5, and 


therefore P cero = 


~ 1, where Z is the 


Control Program Efficiency. An obviously effec- 


tive method of improving multiprocessor throughput 
is by directly decreasing the execution time of 
the critical section portions of the control pro- 
gram. Figure 11, line D illustrates the perfor- 
mance to be expected if the efficiency of the con- 
trol program were improved by a factor of 2. In 
this case P=4 instead of oO =2. 


Partitioning the Control Program. The lockout 
which is necessary in control programs does not 
necessarily lock out all of the critical sections 
in the program. Earlier, an equation was developed 
for lockout assuming there were J partitions of 
the control program with independent critical 
sections. This equation was used to obtain the 
performance indicated in Figure 11, line C, for 
J=2. In reference [18] it was also suggested 
that a small number (2, 3 or 4) or partitions 
significantly improve efficiency. It should be 
obvious that the limiting number of partitions 
that could be incorporated is not a large number 
however. 


Increasing Individual Processor Efficiency. 


The level at which application programs inter- 
face with the control program has the same impact 
on efficiency as does the overhead involved in the 
control program. If the execution time of the 
typical application program task is increased 
such that the number of executable instructions is 
doubled, the same efficiency advantages will accrue 
as if the overhead of the control program were 
reduced to one-half its original value. One must 
be careful in this regard, however, since the uti- 
lization of processors can be significantly re- 
duced. Utilization was ignored in this article 
by assuming that there are no processors in the 
idle state. (See Figure 3.) The job control 
languages of batch processing systems largely 
determine the task level. This is a critical 
issue particularly in mainfram multiprocessing, 
but one which is beyond the scope of the current 
article. 
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Kuck [14] investigated the potential for 
breaking up general problems into parallel segments 
to obtain commensurable speedup. The inherent 
parallelism was shown to be roughly proportional 
to the size of the application program, if the 
program units which are dispatched are taken to a 
low enough level. This is in contrast to what was 
formerly thought to have been an order of log 
relationship [14]. Thus, there is potential in 
the programs themselves for solution by parallel 
arrays of slow processors to obtain very high 
throughput. But this level would reduce the 
effective value of A by orders of magnitude which 
in turn reduces 0 (and with it feasibility) by 
orders of magnitude. And thus, methods which 
artifically increase 90 do not attack the multi- 
microprocessor program. 


The Future of Multiprocessing 


It has been demonstrated that the high lever- 
age design considerations in multiprocessing at 
this time are control table lockout and the con- 
trol program overhead. Hardware support for the 
multiprocessor executive is the obvious place to 
look for help, since the improvement required to 
realize large arrays of processors is orders of 
magnitude rather than simple multiples. 


Let us consider the potential of such solu- 
tions to determine whether there are other theo- 
retical problems. Figure 12, line A illustrates 
the system described originally in Figure 9, but 
assuming an individual processor efficiency of 
Pp = 100. In this configuration memory contention 
becomes appreciable after about 10 or 20 processors, 
and the maximum achievable throughput is seen to 
be about 40 processors. But as shown in Figure 12, 
line B, the asymptotic limit can be more than 
doubled by increasing the relative speed of the 
memory response logic. In this case k=3 rather 


than k=l. 
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FIGURE 12: 


In going to such high throughput systems, 
however, there would be requirements for commen- 
surably larger numbers of memory modules. Figure 
13, line A, illustrates the situation for P =100 
with "square" multiprocessors (M=N) and k=1. The 
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SAME NUMBER OF PROCESSORS AND MEMORIES 
improvement in contention with increasing memory 
response time can be seen in lines B and C respec- 
tively for k=2 and k=3. 


Conclusions 


It has been shown that at least analytically 
there are no size limitations to conventional 
multiprocessing approaches which are beyond the 
current state of the art except control program 
efficiency. Hardware seems to be the only effec- 
tive way of significantly increasing this para- 
meter. Exploring methods of increasing hardware 
support for the control programs is therefore the 
most likely avenue to extending the limits for 
multiprocessor throughput performance. A com- 
panion paper discusses such an approach for which 
multiprocessor control can be made extremely 
efficient [1]. 
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PARALLEL SYSTEMS WITH DYNAMIC STRUCTURE 
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Amherst, Massachusetts 


Abstract. Contemporary complex software 
systems are frequently organized as collections of 
interacting parallel processes in which processes 
are created and destroyed or patterns of process 
interaction are altered during system execution. 
However, existing formal models of parallel compu- 
tation generally assume a fixed set of processes 
and/or fixed patterns of process interaction. 

This report discusses a new modelling scheme cap- 
able of representing parallel systems with dynamic 
structure. An example is presented illustrating 
the model's use in studying correctness of process 
interaction in complex software systems whose 
structure may change during system execution. 


Introduction 


An active area of interest in computer 
science within the last few years has been the 
area of software reliability. A great deal of 
effort has been directed toward discovering design 
methods and analysis techniques applicable to the 
production of correctly-working software systems. 
An important basis underlying mucn of this effort 
in software reliability has been work in the 
formal modelling of software systems. 


The research described here is aimed at the 
development of a formal modelling scheme applica- 
ble to one particular class of complex software 
systems. 
with dynamic structure, consists of systems of 
parallel processes - in which processes may be 
created and/or destroyed, and in which patterns 
of process interaction may vary, over the course 
of system execution. Before considering the 
modelling scheme for parallel systems with dynamic 
structure, we outline in the next two sections of 
this report some background and motivation for 


this work. 


Complex Software Systems 


It is possible for any sequential computer 
program to be arbitrarily complicated if it is 
sufficiently large or poorly designed. However, 
"complex software 
system' refers to a large system in which con- 


current or parallel activity is a significant 


aspect of system behavior. Most modern operating 
systems or data base management systems would be 
considered complex software systems according to 
these criteria. Such complex software systems 
nave the additional attribute of being difficult 
to understand due to their large size and con- 
current activity, and are therefore also difficult 
to design and analyze. 


This class, which we refer to as systems 
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To facilitate the understanding of any 
complex system, the standard approach as discuss- 
ed by Simon [21] is to decompose the system into 
an organized collection of smaller, simpler parts. 
Hopefully the system can then be understood in 
terms of these smaller, simpler parts and the 
interactions among them. In the specific case 
of complex software systems, this approach 
translates into decomposition into processes [6] 
or modules [16]. The result may be referred to 
as a system of cooperating sequential processes 
[4] or a system of interacting parallel processes 
These equivalent terms emphasize different aspects 
of the resulting system. The former stresses the 
sequential nature of the individual processes 
whose cooperation results in the overall system 
behavior, while the latter underscores the fact 
that the component processes are operating in 
parallel, with their interaction determining the 
system's performance. 


Decomposition of a complex software system 
into a system of interacting parallel processes 
is a useful step toward understanding complex 
software systems, but it by no means completely 
solves the problem. The parts are indeed simpler, 
but the interactions among them remain quite 
complicated. This fact is witnessed by the 
appearance in the literature of incorrect solu- 
tions to comparatively simple interaction situa- 
tions (e.g., [7]). It is all the more difficult 
to attain understanding of a complex software 
system with dynamic structure, where processes 
may be created and/or destroyed and interprocess__ 
communication paths (process interaction patterng_ 
may be altered during system execution. It is . 
for this reason that formal modelling schemes may 
be useful in designing and analyzing complex 
software systems. 


Software Reliability and Formal Models 


In the domain of sequential programs, the 
utility of formal models as a foundation for 
software reliability work is quite evident. For 
example, the widely known structured programming 
method of Dijkstra [3], Wirth [24] and others, 
draws upon the formal work of Bohm and Jacopini 
[2]. Proof techniques applied to program correct 
ness by Floyd [5], Manna [14] and others are 
closely related to the program scnema model which 
was defined by Ianov [20]. 


In contrast to the situation regarding the 
domain of sequential programs, the realm of 
parallel programming (i.e., complex software 
systems) has yet to settle upon any accepted 


approaches to software renee Nonethe- 
less, the influence of formal models can be seen 
here also. Campbell and Habermann's path expres- 
sion work has been closely tied to Petri nets [13]. 
Keller [11] has recently introduced a verification 
methodology for parallel systems based upon a 
formal model which he has defined, which is in 
turn an extension of Petri's. The DREAM system 
for design and analysis of complex systems [19, 
23], has as its basis Riddle's formal modelling 
scheme for complex software systems [18]. 


Given the evident utility of formal modelling 
schemes for work in software reliability, it & 
would appear useful to have a formal model for 
parallel systems with dynamic structure. However, 
the existing formal models of parallel computation, 
such as Petri nets [17], parallel computation 
graphs [8, 1], parallel program schemata [9, 10], 
and PPML/MTEs [18] all apply only to essentially 
Static structure situations and are therefore of 
little use in modelling parallel systems with 
dynamic structure. It is for this reason that we 
are developing a formal modelling scheme for 
parallel systems with dynamic structure. 


Modelling Parallel Systems with Dynamic 
Structure 


A formal model for parallel systems with 
dynamic structure attempts to represent systems 
whose structure changes over time. In developing 
such a modelling scheme, we have taken a particu- 
lar view of parallel systems with dynamic struc- 
ture. Since our main interest is in modelling 
complex software systems, we consider the systems 
which are to be modelled as being composed of 
interacting parallel processes. The modelling 
scheme focuses upon representing dynamic structure 
and process interaction; in particular, internal 
computations of the individual processes are 
abstracted (i.e., not explicitly represented) in 
the modelling scheme. Parallelism in the model 
is represented by an arbitrary interleaving of 
strings of indivisible events, a fairly standard 
technique. 


Two major components are required for a 
modelling scheme for parallel systems with 
dynamic structure. The scheme must provide a 
means for describing the possible processes which 
might appear (be instantiated) in the system 
during the course of its operation. Descriptions 
of potential processes must describe both their 
(potential) behavior and the possible communica- 
tion paths by which they may interact with other 
processes in the system. The scheme must also 
allow for the description of the system's configu- 
ration at a given time, and for describing changes 
in that configuration. A configuration descrip- 
tion should include a representation of the 
currently active (instantiated) processes, the 
current process interaction configuration and the 


current states of the individual processes and of 
(a) 


Examples of the directions which have been 
proposed to date include [12], [15], [11] and 
[18]. 
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interprocess communication. The remainder of 
this report outlines the major features of a 
modelling scheme for parallel systems with 
dynamic structure which includes both of these 
components and also presents a simple example 
illustrating the use of the modelling scheme. 
Complete details on the modelling scheme and 
additional examples of its use may be found in 
(2216 


Elements of a Modelling Scheme 


The behavior and communication possibilities 
for potential processes in the dynamic process 
modelling scheme are described using process 
templates. Each process template represents a 
class of potential processes; at any given time 
the modelled system may include zero or more 
instantiations of any particular process class. 
The process template is described using a simple 
abstract programming language called the Dynamic 
Modelling Language or DYMOL. DYMOL contains a 
sufficient set of control constructs, including 
conditional branching instructions which allow 
for testing the current system configuration, the 
last message received by the process or the 
result of some computation internal to the 
process. The latter is essentially a non-deter- 
ministic branch, since internal process computa- 
tion is not explicitly represented in the model- 
ling scheme. DYMOL also includes an instruction 
for setting the contents of the distinguished 
process storage location known as the buffer. 
Each process has, implicitly, a buffer which is 
the source for its message transmission to other 
processes and the sink for its message reception 
from other processes. In addition to these, 
DYMOL includes the six instructions listed below: 


CREATE <process class id> <process ref var> 
DESTROY <process id expr> 

ESTABLISH .<port name> <port name> 

CLOSE <port name> <port name> 

SEND <port id> 

RECEIVE <port id> 


The first two commands, CREATE and DESTROY, 
bring about changes in the process structure of 
the modelled system. The CREATE command, when 
executed, causes a new process of the class 
specified by its first parameter to be added to 
the system. The unique process identifier 
assigned to the newly created process as part 
of the creation operation is returned to the 
creating process in the result variable specified 
as the second CREATE command parameter. A unique 
process identifier is formed according to the 
following BNF production: 


<process id> ::= <process class id> <integer 


Integers are not allowed to appear in process 
class identifiers, which are specified as part 

of each process template. The DESTROY command, 
when executed, causes the specific process named 
by its parameter to be deleted from the system. 

A process identifier expression can be any of the 
three possibilities given by the following BNF 


production: 


<process id expr> ::= <process id> | 
<process ref var> | ME 


since any of these evaluates to a process identi- 
fier. 


The ESTABLISH and CLOSE commands are used to 
alter the interprocess communication configuration 
of the modelled system, while the SEND and RECEIVE 
commands are used for actual interprocess communi- 
cation. Each process template specifies a set of 
ports through which processes of the given process 
class may communicate with other processes in the 
system. These ports are implicitly declared by 
the <port id> parameters of the SEND and RECEIVE 
commands. Thus, each instantiation of a particu- 
lar process class has the same set of local port 
identifiers. The ESTABLISH command, when executed, 
causes the specified ports of two particular 
processes to be connected by a message buffering 
channel called a link. The particular ports which 
are to be connected are specified by their port 
names which are unique identifiers formed as 
indicated by the following BNF production: 

<port name> ::= <process id expr> <port id> 
The special DYMOL process identifier expression 
ME can be used to indicate that the port named is 
a port of the process executing the ESTABLISH 
command. The CLOSE command, when executed, causes 
the link connecting the two specified ports to be 
deleted from the system. The SEND command, when 
executed, transmits the message currently contain- 
ed in the process’ buffer out through the speci- 
fied port. Messages in the modelling scheme are 
represented by a finite set of message classes. 
Since the SEND command takes a port identifier 
for its parameter, a sending process need not and 
cannot specify the receiving process for the 
message. Any process connected to the port 
through which the message is sent may receive the 
transmitted message. (Of course, the sender may 
be able to control the destination of a message 
using the ESTABLISH command.) The RECEIVE 
command, when executed, causes a message from one 
of the links connected to the specified port to 
be brought into the receiver's buffer. The link 
from which the message is to be received is non- 
deterministically chosen from those currently 
containing messages, and the message received is 
non-deterministically chosen from among those 
currently in the selected link. If none of the 
links connected to the specified port currently 
contains a message, the process executing the 
RECEIVE is suspended until a message is available, 
at which time it may receive the message and 
continue execution. 


Figures 1 and 2 are example process templates 
The two process classes represented, SYNCH and 
TASK, are isomorphic at the level of abstraction 
of the modelling scheme. They will be distin- 
guished in the example to be presented below. 
The statement labels found in the DYMOL process 
template descriptions are not used for branching, 
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but rather are needed for the specification of 
process state, as discussed below. DO FOREVER is 
an infinitely looping control construct, while 
BEGIN, END and the semicolon separator syntax 

are algolic. The graphical representation of 

the process templates is a useful device for 
presenting the models; note the distinction 
between inbound and outbound ports and also the 
boxes representing links. 


Having a means for describing potential 
processes, we now require a means for describing 
the system's configuration and changes to that 
configuration. In the dynamic process modelling 
scheme this is done primarily through use of the 
configuration matrix, C. C is a matrix whose 
indices are the set of unique process identifiers 
for active (instantiated) processes in the system 
The entries of C describe the current inter-_ 
process communication configuration of the system 
@ach entry containing a set of ordered pairs of 
port identifiers. The appearance of the ordered 
pair &,y) in C(a,b) indicates that port x of 
process a is currently connected to port y of 
process b. Thus the configuration matrix 
summarizes the current configuration of the 
system at any point in time, its indices provid- 
ing a list of the active processes in the 
system and its contents indicating the current 
interprocess communication linkages. 


SYNCH: 
SYO:; DO FOREVER MUTEX V 
BEGIN 
SY1L: RECEIVE MUTEX V; 
BY: SEND MUTEX P 
END. MUTEX P 
Figure lL 
TASK : O) 
TAO: DO FOREVER IN 
BEGIN 
TAL: RECEIVE IN; 
TA2: SEND OUT 
END. OUT 
Figure 2 


The configuration matrix C may be altered 
by execution of the DYMOL commands CREATE, 
DESTROY, ESTABLISH and CLOSE. In fact, it is 
these changes to C which define the semantics of 
the four DYMOL commands. Execution of the CREATE 
command causes a new row and column to be added 
to C, representing the newly created process. 
Thus, suppose that the current configuration 
matrix were: 


SCHEDL SYNCH1 
SCHEDL 


SYNCHL 


Then, execution of the DYMOL command: 


CREATE TASK TVAR 


might result in the following configuration matrix: 


SYNCH1 


SCHED1 


SCHED1 
SYNCH1 


TASKL 


All the entries in the newly-added row and column 
are initially null, indicating that no inter- 
process communication connections yet exist 
between the newly-created process and any of the 
other processes. 


Execution of the ESTABLISH command causes an 
ordered pair to be added to an entry in the confi- 
guration matrix. Thus, if the next DYMOL command 
to be executed were to be either of the equivalent 
pair: (d) 


ESTABLISH SYNCH1 .MUTEX P TASK. IN 
or 
ESTABLISH SYNCH .MUTEX P TVAR.IN 
(c) 


the resulting configuration matrix would be: 


SYNCH1 TASK1 


SCHED1 


SCHED1L 


SYNCHL1 


TASK1 


Thus, the new entry in C(SYNCH1,TASK1) indicates 
that a communication link exists between the two 
processes SYNCH] and TASK1 through ports MUTEX P 
and IN, respectively. This link provides a means 
for SYNCH1 to send messages to TASK1, but the null 
entry in C(TASK1,SYNCH1) indicates that TASK1 has 
no capability of sending messages to SYNCH1 at 

(b) 


The two commands given are equivalent only due 

to the assumption that one of them is the next 

command executed, so that the process reference 
variable TVAR still contains the process iden- 

tifier TASKI1. 


(c)To simplify notation, set brackets around 
singleton sets are omitted throughout this 
report. 
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this point. 


Execution of the CLOSE command causes an 
ordered pair to be deleted from an entry in the 
configuration matrix. Thus, if the next DYMOL 
command to be executed were to be either of the 
equivalent pair: 

CLOSE 


SYNCH1.MUTEX P TASK1.IN 


or 
TVAR. IN 


CLOSE SYNCH1.MUTEX P 


the resulting configuration matrix would be: 


SCHEDL 


SYNCH1 TASK1L 


The new null entry in C(SYNCH1,TASK1) indicates 
that SYNCH] can no longer send messages to TASKI. 
However, any messages which it may have already 
transmitted to TASK] prior to execution of the 
CLOSE command remain available for TASK1 to 
receive at a later time. 


Execution of the DESTROY command causes a 
row and column to be deleted from the configura- 
tion matrix. (d) Thus, if the next DYMOL command 
to be executed were to be either of the equiva- 
lent pair: 


DESTROY TASK1 


or 


DESTROY TVAR 


the resulting configuration matrix would be: 


SCHED1 SYNCH1 


Given the modelling scheme outlined above, 
a model in the scheme consists of a set of 
process templates, an initial instantaneous 
configuration (the notion of instantaneous con- 
figuration will be defined shortly) and (option- 
ally) a set of terminal instaneous configurations 
More formally, a model M = (P,x<0>,T) where: 


P = a finite set of process templates 
defined by abstract programs in the 
Dynamic Modelling Language 

x<0> = an instantaneous configuration, call- 


ed the initial instantaneous confi- 


guration, and 


(Denis is a slight simplification of the actual 
semantics for DESTROY as given in [22]. 


T = an optional set of instantaneous configu- 


rations, called the terminal instan- 


taneous configurations set 


Thus to model a parallel system with dynamic 
structure using the dynamic process modelling 
scheme, one would describe the set of potentially 
active processes using the abstract programming 
language, specify an initial system configuration 
and perhaps specify any terminal configurations of 
interest (e.g., a configuration representing a 
system deadlock). The modelling scheme can thus 
be used in a purely descriptive fashion, which 


could prove useful in the design of complex 
‘software systems with dynamic structure. 


An instantaneous configuration for a model M 
is a complete description of the model at some 
instant of time during its execution. Thus, an 
instantaneous configuration must specify the 
current set of active processes and the state 
(i.e., location counter values and message buffer 
contents) of each, the current process interaction 
configuration and the current state of each link 
currently in the model. In formal terms‘©/, an 
instantaneous configuration x = (C,A,Q,L) where: 


C = current configuration matrix 

A = current set of active processes (i.e., the 
indices of C) 

Q = current state of processes in A (i.e., 
location counter and message buffers of 
processes in A) 

L = current state of links 


Using the notion of instantaneous configuration, 
it is possible to define a dynamic process 


computation step y<i,j> = x<i>x<j> where x<i> and 


x<j> are instantaneous configurations for a model 
M and x<j> is the result of the occurrence of a 
single primitive operation (e.g., execution of 
one process command or transfer of one message) 
with M in instantaneous configuration x<i>. Then 
a computation for a model M may be defined as 

z= y<O,1>-y<1,2>° ... sy<t-1l,t> = 
x<0>x<1L>x<2>...x<t-1l>x<t>, i.e., a series of 
computation steps beginning with the initial 
instantaneous configuration of M and ending with 
a terminal instantaneous configuration (i.e., 
x<t>e T) or with any instantaneous configuration 
of Mif T is not specified. Finally, the set of 
all possible computations for M may be defined 

as Z<M> = {z|z is a computation for M}. 


A Simple Example 


The use of the dynamic process modelling 
scheme and an indication of its potential utility 
for software reliability work can be illustrated 
by the following simple example. The scheme is 
()onis definition and those following it in the 
remainder of this section are somewhat simpli- 
fied versions of the definitions given in [22]. 
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used to model a set of tasks, created and destroy- 
ed by a scheduler, all accessing a shared resource 
requiring exclusive access. An example of such 

a situation might be an on-line reservation system 
composed of a number of reservation-making tasks 
accessing a common data base, where the number of 
tasks varies with system load. 


The process templates for the tasks and for 
the process which will synchronize the tasks' 
access to the shared resource were presented in 
figures 2 and 1, respectively. Figure 3 is the 
template for the scheduler process, which alter- 
nately creates new tasks, connecting them to the 
synchronizer, and destroys tasks. The WHILE 
INTERNAL TEST construct models the unelaborated 
internal process computation which determines 
when tasks are to be created or destroyed, repre- 
sented in the modelling scheme as non-determinis-— 
tic, indefinite iterations. The FOR SOME 
construct selects a value for TVAR; it is 
evident that the resulting value must be a process 
identifier for a TASK process. 


Given these three process templates and an 
initial configuration matrix, C<0>, of the 
following form: 


SYNCH1L 


then the model M may be represented formally and 
eraphically as shown in figure 4. By convention, 
the entries in the current process state tuple, Q, 
are presented in the same order in which the pro- 
cesses are listed in A. Each entry is itself a 
tuple in which the current process location 
counter value, expressed by statement label, 
appears first, the current buffer contents appear 
last, and the values of any internal process 
variables appear in keyworded form between these 
two. Note the message SEM which is initially 
located in the link connected to the MUTEX P port 
of SYNCH1. 


SCHEDL 


SCHEDL 


SYNCH1 


SCHED: 
SCO: WHILE INTERNAL TEST DO 
BEGIN 
WHILE INTERNAL TEST DO 
BEGIN 


CREATE TASK TVAR; 
ESTABLISH TVAR.OUT SYNCH1. MUTEX _V; 
ESTABLISH SYNCH1.MUTEX P TVAR.IN 
END; 
SC2: WHILE INTERNAL TEST DO 
BEGIN 
FOR SOME TVAR ec A - {SCHED1,SYNCH} DO 
BEGIN 
DESTROY TVAR 
END 
END 
END. 


SG1: 


Figure 3 


M = (P, x<0>, T) 
where 
P = {SCHED, SYNCH, TASK} 


x<O0> = (C<0>, A<O>, Q<O0>, L<O>) 
T= {(C,A,Q,L) | q SO) Vi Ga a) 
bd > CHL > bd] A K bd 
and b=(O5.0.500 7° = se 
A<O> = {SCHED1, SYNCH1} 
Q<0> = ((SCO, TVAR = @, 0), (SYO, 9)) 
L<O> = (<SEM>) 
MUTEX V MUTEX P 
Figure 4 
A aeaebies instantaneous configuration, x<i>, 
for M(f) is presented formally in figure 5 and 


graphically in figure 6. Three TASK processes are 
currently in the system, each connected to the 
SYNCH] process. Some other TASK processes may 
have been destroyed by SCHED1, although this can- 
not be determined from the present configuration 
since (aside from those which the modeller assigns 
to processes present in the initial instantaneous 
configuration) unique process identifiers are 
generated randomly as part of the creation 
Operation and do not necessarily appear in 
numerical order. 


A possible computation step from x<i> is 
shown in figure 7. Only the process state tuple, 
Q, and the link state tuple, L, are changed as a 
result of this computation step, which results in 
the message SEM being transmitted back to the 
SYNCH] process. This may be viewed as the 
relinquishing of the shared resource by TASK/. 


An alternative possible computation step 
from x<i> is shown in figure 8. In this instance, 
SCHED1 has destroyed the task process TASK7, as 
evidenced by the appropriate changes to C, A, 
and Q. The new instantaneous configuration, 
x<i+l> resulting from this computation step is an 
element of the set of terminal instantaneous 
configurations, T, which corresponds to the set of 
instantaneous configurations yielding task dead- 
lock. While the overall system is still capable 
of executing additional computation steps, inclu- 
ding the creation and destruction of tasks, no 
task process can possibly progress since the SEM 
message was destroyed with TASK7 and the modelled 
system is incapable of generating any new messages, 
(E) one of the numerous sequences of computation 
steps for M which could lead to x<i> is 
enumerated in [22]. 
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Thus, all task processes will wait indefinitely 
at their RECEIVE instructions (TAl1). The 
appearance of a computation resulting in task 
deadlock would most likely indicate to a software 
system designer that the modelled system 
corresponds to an incorrect design. The example 
thus suggests the potential utility of the 
dynamic process modelling scheme as a software 
design tool. 


Finally, a corrected version of the model MW, 
based upon the revised scheduling process SCHED' 
illustrated in figure 9, is presented in figure 
10. The model M' incorporates the necessary 
synchronization of the scheduler and the task 
processes to prevent the destruction of a task 
which is currently holding the SEM message (i.e., 
a task which is in its critical region). For the 
model M', there is no computation z which results 
in an instantaneous configuration x<t> in T', and 
thus the revised model corresponds to a design 
which is free from task deadlock. (A proof of 
this assertion may be found in [22].) 


In Conclusion 


The dynamic process modelling scheme which 
we have described here was developed as a basis 
for formulating design methods and analysis 
techniques applicable to complex software systems 
with dynamic structure. To that end, we are 
currently incorporating its concepts and 
constructs into the DREAM software design aid 
system. The modelling scheme has also been used 
to investigate decidability issues for dynamical- 
ly-structured parallel systems, to define and 
study subclasses of these systems, and as a 
vehicle for considering the necessity and model- 
ling power of various representational constructs 
At the same time, a non-procedural, expression- 
based behavioral representation technique, called 


constrained expressions, has been developed and 


an effective procedure has been defined for 
deriving constrained expressions from certain 
subclasses of parallel systems with dynamic 
structure. It is hoped that continued research 
along these various dimensions will lead to an 
improved understanding of parallel systems with 
dynamic structure and to genuinely useful tools 
for designers of complex, dynamically-structured 
software systems. 
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x<i> = (C<i>, A<i>, Q<i>, L<i>) 


where 


SCHEDL SYNCH1 TASK2 TASK7 TASK10 


SCHED1 


SYNCH1 d 
C<i> = 
TASK2 tf) 
TASK10 (OUT, MUTEX_V) 


A<i> = {SCHED1, SYNCH1, TASK2, TASK7, TASK1O} 


Q<i> = ((SCl, TVAR=TASK7, 6), (SY1, @), (TAl, #), (TA2, SEM), (TAl, @)) 
L<i> = (0, 0, @, 9) 
Figure 5 
S : “ 
S MUTEX V MUTEX P ae 
SCHED1 : 


IN 


OUT 
OUT 


IN 


y<i, itl> = x<i>x<itl> 
where 
x<itl> = (C<i>, A<i>, Q<i+l>, L<i+l1>) 
Q<it+tl> = ((SC1, TVAR=TASK7, 0), (SY1, 0), (TAL, @), (TA1, 6), (TA1, $)) 
L<it+tl> = (%, @, <SEM>, @) 
Figure 7 
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y<i, itl> = x<i>x<itl> 
where 


x<itl> = (C<itl>, A<itl>, Q<i+l, L<i>) 


SCHED1 SYNCH1L TASK2 TASK10 


C<itl> = 
A<i+l> = {SCHED1, SYNCH1, TASK2, TASK10} 
Q<it+tl> = ((SC2, TVAR = TASK7, 0), (SY1, 0), CTAL, @), (TAI, 9)) 
Figure 8 
SCHED! : 
SCO: WHILE INTERNAL TEST DO 
BEGIN 
WHILE INTERNAL TEST DO 
BEGIN 
CREATE TASK TVAR; 
ESTABLISH TVAR.OUT SYNCH1.MUTEX V; 
ESTABLISH SYNCH1.MUTEX P TVAR.IN THAW FREEZE 
END; 
SC2: WHILE INTERNAL TEST DO 
BEGIN | @ 
FOR SOME TVAR « A — {SCHED1, SYNCH1} DO ) 
BEGIN 
RECEIVE FREEZE; 
SOF br DESTROY TVAR; 
SEND THAW 
END 
END 
END. 
Figure 9 
M' = (P', x<0>', T') 
where 
P' = {SCHED', SYNCH, TASK} 
x<0>" = (C<O0>', A<O>', Q<O0>, L<O>) 
t= =(— = ( a= 7 =(-— = eee 
SCHED'1 SYNCHL MUTEX P 
' — 
ev MUTEX V 
SYNCHL (MUTEX P, FREEZE) 
A<O>' = {SCHED'1, SYNCH1} 


Figure 10 
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A Seneralized Cluster Structure for Large Multi-microcomputer Systems * 


Shyue B, Wu and Ming T. Liu 
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Summary 


This paper presents a generalized cluster 


structure for interconnecting a large number of 
microcomputers, where each microcomputer 
(microprocessor plus memory) constitutes a node 
of the cluster. It turns out that four of the 
popular cluster structures (viz., hypercube [4], 
hierarchy [2], star [l] and tree [3]) are all 
special cases of the generalized cluster 
structure characterized by different parameter 
values, Through the use of the generalized 


tool for analyzing a 
structures are also 


cluster structure, a unique 
variety of interconnection 
discussed, 


structure consists 
can be completely 
The three components 


The generalized cluster 
of three components and 
described by two functions, 


are the computation node, the switching node and 
the path; and the two functions are the 
interconnection function and the switching 


function, 


A computation node (CN) is a microcomputer 
which contributes computing power to the system, 
A switching node (SN) is a microcomputer which 
performs switching functions, A path is a medium 
over which messages are passed, The medium may 


be a dedicated link or a time-shared bus. It is 
viewed as a bus (BUS) in this paper, An 


interconnection function (IF) specifies a way of 
interconnecting buses, Hence it characterizes a 
topological structure, A switching function (SF) 
can be circuit switching, mesage switching or 
packet switching. 


Let N, I, S, B, E, F, G, M and L be the 
parameters of a generalized cluster structure, 
all taking on integer values except for M and lL. 
Then the whole system is organized into N levels 
of subclusters (N >} 1). Several elements of 
each component (I, S, B elements respectively for 
CN, SN, BUS) are initially interconnected to form 
a basic subcluster. Then M subclusters at the 
same level and several additional elements of 
each component (E, F, G elements respectively for 


CN, SN, BUS) are interconnected to form a 
subcluster or portion thereof (L) at one higher 
level. 

Formally, a generalized N-level cluster 
structure is a 5-tuple, (CN, SN, BUS, IF, SF), 


where, for iz=1, 2 ---N, 


CN :;: A set of CNs =[ Ci ] 
SN : A set of SNs =[ Si ] 
BUS : A set of level-i buses, 


ee ed a eed de od 
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IF : A set of interconnection functions = [Bi] 

Bi = specification of a level-i bus 

=f(Ci, Si) e [Ci, Si] ¢ [Cn, Sn] 

1Bil = no of nodes on a level-i bus 

SF : A set of switching functions 

Using different parameter values of the 
generalized cluster structure, we can specify 
different structures, The parameter values of 


four special cluster structures (viz., hypercube, 
hierarchy, star and tree) are tabulated in Table 
[. 


A, Components 


CN SN BUS MEMBER 
I E S F B G M L 
l. K 0 O* K 1 iL K l 
2. 1+K 1 tS 1 K K K 1 
3% K 1 O* 1 id 2/K (K-1)/K 1/K 
4, K. 0 1 1 1 1 K 1 
B. Interconnection Functions 

Bl 1B11 Bi+l 1Bi+ll 
l. I K Ci K 
26 I 2 Ci,F 2. 
Bs I K Ci K 
4, I,s 1+K Si,F 1+K 


1 = Hypercube; 2 = Hierarchy; 3 = Star; 4 = Tree 

* One of I CNs takes care of the control and 
communication over each of B buses, 

S$ One I CNs is viewed as an SN to take care 
of path control communication and switching 
function over K buses, 


Table 1: Parameter Values of 4 Special Structures 


Linear complexity and easy extensibility are 
two important characteristics of the generalized 
cluster structure, That is, the complexity (thus 
the cost) of the system increases linearly with 
the power of the system (or the number of 
computation nodes) and the system can be easily 
extended to an arbitrary number of nodes, 


The generalized cluster structure can cover 
a variety of interconnection structures, A 
unique tool for analyzing and comparing different 
Structures has been obtained [5]. Traffic 
congestion and’ message delay have been two 


serious problems in all structures supporting 
indirect interprocessor communication, Traffic 
congestion can occur when a node or a path is 


overloaded, whereas message delay is a result of 


indirect interprocessor communication, 


An analytical model based on the probability 
of utilization is proposed to consider traffic 
congestion and message delay problems. This 
model accepts a given set of interprocessor 
communication traffic and outputs a set of 
analysis results, The intra-cluster 
communication traffic is assumed to be equally 
distributed and the inter-cluster communication 
traffic is assumed to_ be symmetrically 
distributed, The interprocessor communication 
traffic is expressed by the degree of local 


cooperation, which is defined to be the frequency 


that a node issuses a message to another node at 
the same level of subcluster, 


Traffic 
probability 
The theoretical 


congestion is determined by the 
of utilization of system components, 
upper bound of how ~ much 
interprocessor communication traffic a structure 
can support is the maximum interprocessor 
communication traffic without having theoretical 
traffic congestion, The maximum interprocessor 
communication traffic a structure can have is 
. expressed by the minimum degree of local 
cooperation, This optimal value (called OPTLC) 
depends on interconnection structure and system 
size. Figure 1 shows the OPTLC values vs 
different system sizes of the four popular 
cluster structures, 

Message delay is measured by the number of 
buses needed for passing a message. It depends 
on which level of subcluster the source and 
destination nodes of the message belong to, That 
is, the interconnection distance between the 


source and the destination, Average 
delay is the average value of all message delay. 
It depends on the quantity and the distribution 
of message traffic, Giving the optimal message 
traffic derived from Figure 1, the average 
message delay (OPTDAVG) vs different system sizes 


message 


of the four popular cluster structures are 
plotted in Figure 2, 
QO 
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a HIERARCHY 
STAR 
& ® TREE or} 
“ 2 
% 
o 
oO 
| 
a. 
OF 
o 
be] 
i) 
rs) 


2.35 2.80 3.25 3.70 4.15 
L0G10 (CN) 


Figure 1 : OPTLC vs System Sizes 
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The lower the OPTLC value is, the better the 


structure is, since the system will have a 
greater degree of extensibility, OPTDAVG tells 
us the average message delay and the smaller the 


OPIDAVG is, the better the structure 
two values could give 


is, These 
us an idea about system 


performance and application restriction, In 
interactive and real-time applications, these 
could be very helpful. 

The generalized cluster structure has been 


demonstrated as a useful tool for interconnection 
system design. Selecting a particular 
interconnection structure depends on system 
applications and restrictions, The results from 
our analysis are qualitative rather than 


quantitative, thus one must be careful in 


interpreting the results. 
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PARALLEL TRANSITION MACHINES 
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Abstract -- The architecture for a general 
purpose parallel computer called a Parallel Tran- 
sition Machine is derived from an abstract theo- 
retical model of parallel computation. Its devel- 
opment is pursued to a design description involv- 
ing hardware block diagrams. Applicability across 
a broad spectrum of computational requirements is 
suggested by the modular extendability of the 
hardware units. It is apparent that computational 
problems can be broken down in a design hierarchy 
where each level executes in a virtual Transition 
Machine dynamically assignable to a free hardware 
unit. 


Introduction 


There are completely general models of paral- 
lel computation [8], but there are currently no 
machine architectures suitable for efficient exe- 
cution of parallel programs generated in accord- 
ance with one of these models. The parallel com- 
puter architectures which do exist are special 
purpose devices appropriate only within a re- 
stricted domain of any general parallel computa- 
tional model. For example, the restricted homo- 
geneous parallelism afforded by an associative 
or an array processor can only perform identical 
operations on multiple data sets concurrently. 


This paper describes a family of computa- 
tional machine architectures which implements a 
general model of parallelism. The conceptual 
model of parallel computation described by 
Keller [8] under the nomenclature of transition 
systems has been accepted here as the basis for a 
more detailed model of a machine architecture. 
This Parallel Transition Machine model provides 
a machine architecture in which transition systems 
can be executed. The details of this architecture 
are defined to a level where development can pro- 
ceed. A prototype system is currently under de- 
velopment; some of the detail design issues ad- 
dressed by this development are discussed in 
reference [3]. . | 


The architecture can be characterized as a 


multiprocessor with a separate System Controller 
as shown in Figure 1. Interrupts, I/0 control- 
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FIGURE 1: PARALLEL TRANSITION MACHINE 
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lers, and other special purpose processors can be 
integrated into this model of Transition Machines, 
and introduce no significant developmental pro- 
blems. The System Controller is in essence a 
functional equivalent (implemented in hardware) 

of a multiprocessor executive for transition sys- 
tems. The operation of the System Controller is 
effected by a series of logical operations on 
fixed data constructs descriptive of the condi- 
tions under which the various computations become 
eligible. It effects the system transitions by 
performing matrix operations on a system status 
vector to obtain a procedure eligibility vector. 
The procedure eligibility vector provides a basis 
for task assignments to the processors; completion 
of the assignments results in a modified system 
status vector. 


The development cost of System Controllers 
is small relative to the cost of the multiproc- 
essors which are controlled by them. System tran- 
sitions can be effected in a fraction of the time 
that is currently required for straightforward 
software multiprogramming executives. This sup- 
ports an approximately linear extendibility of 
throughput in large array multiprocessors. To 
implement the equivalent system control matrix 
operations in a software multiprocessing executive 
would be infeasible due to the high overhead as 
shown in reference [3]. 


The application of Parallel Transition Ma- 
chines to large systems is extremely promising, 
and the feasibility of configuring arrays of 
coordinated microprocessors seems evident [14]. 
But large systems introduce commensurate chal- 
lenges; for example, an operating system and link- 
age editor of considerable complexity are required 


to implement the overlaying of partitioned matrices. 


Parallel Transition Machines also require 
programming structures which are not traditional. 
These non-traditional structures provide the ad- 
vantages of highly structured programs which re- 
sult in enhanced software productivity [1]. It 
is possible, however, to develop only the trans- 
lator for a suitable existing compiler so that 
traditional program structures could be trans- 
lated to run on Parallel Transition Machines. 


Abstract Parallel Computation Model 


Parallel Transition Machines are based ona 
particular abstract model of parallel computation, 
selected because of its generality. It is tran- 
sition systems (Q,*), where Q is the set of pos- 
sible system states and > is the set of transi- 
tions between states as described by Keller. A 
named transition system is a triple (Q, >, %). 

The components correspond respectively to the set 
of possible system states (q,> do» Tae++)> a set 


of transitions between states > 9 maree)s 


as 
groups of individually programmed transitions be- 
tween states [8]. Since there is a one-to-one 
correspondence between the indices on sigma and 
the names themselves, the indices will be used to 
indicate the names: i implies Ors and I = {i} 


implies }. The index ie I is associated with a 
group of system transitions described by the 
statement: 


when R, (&) do &' = 


and a set of names (0,5 -.-) associated with 


W, (e) 


The 
follows: 


symbols in this statement are defined as 


jee 
I 


the index of the group of transitions 
whose common feature is that they all 
result in the data transformation in- 
dicated by the function V;: 


TY 
Il 


the set of all data items in the system 


ys) 
4s 
oN 
uy 
Ww 
i 


the subset of satisfied propositions 

on the data set, §& which are essential 
to defining the appropriateness, and 
therefore constitute the enabling pre- 
dicate, for transitioning as determined 
by performing the data transformation 


p,@. 


the programmed functional data trans- 
formation, associated with the group 
of system transitions indicated by i, 
which operates on the data set, € and 
results in a revised data set &'. 


W (é) = 


The group 2 can be associated with a proce- 
dure (including preamble) that can be written by 
a programmer to effect the data transformation, 
Y, on the data set —& when the appropriate set of 


conditions R. is satisfied on that data set. 


(Although obviously not the intent in Keller's 
work, it has been demonstrated that program 
requirements can be implemented to advantage in 
this manner [1].) In a parallel computation step, 
multiple. sets of conditions, R. can be satisfied 


simultaneously such that multiple transitions can 
proceed in parallel. The R. are enabling predi- 


cates that indicate the requisite status of pro- 
positions on the data set §€ which properly enable 
the function p,. Relevant propositions that have 


been defined on data elements ene are the follow- 
ing: . 
1. the data element e, is available/not 


available for use in subsequent computations, 


2. the data element ey satisfies/does-not 


satisfy a specified condition relative to some 
constant or other data element ex (for example, 


ey < en1)s and 


a 


the data element ey can/cannot be updated. 


7? 


This paper deals exclusively with valid par- 
allel programs; these programs will not exhibit 
race conditions, and therefore procedures which 
read and write the same data element will have a 
predetermined execution order specified by their 
respective enabling predicates. The properties of 
determinacy, commutativity and persistence are 
described by Keller [9]. These and other proper- 
ties of valid parallel programs are also discussed 
in references [5], [6], [8], and [11]. 


Transition Machine Model - 


As an organizational basis for implementing 
the architectural model of parallel computation, 
we have defined a set of constructs and the logical 
matrix operations on these constructs which effect 
the system control functions for a Parallel Tran- 
sition Machine. The constructs and logic are ex- 
emplified in Figure 2. 
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Eligibility Determination 


Definition 1. The system status vector, S is 
a set of binary status indications for a set of 
propositions concerning the data set € such that 
for every possible proposition on the set there is 
an associated status indication, S in S if and 


only if the proposition on the data set is relevant 
to enabling some procedure in the system. (In 
hierarchical implementations discussed further on, 
conditions which are relevant to every procedure 
at a given level will also be excluded.) S. = ] 


if the associated proposition on the data set is 
met, ae = 0 otherwise. 


For convenience we will use the phrase "data 
condition" in referring to a "proposition on the 
data set" throughout the remainder of this paper. 


Definition 2. The system eligibility vector, 
E is a set of binary status indications for the 
set of predicates Ry» such that for each predicate 
R, there is an associated status indication, ES 
in E indicating whether R, is currently satisfied, 
Hiss 
Lt 


enabling the associated procedure. = 1 indi- 


cates the associated predicate is satisfied; 
Ee = 0 otherwise. 


\ 


Definition 3. A data condition associated 
with ae is relevant to enabling procedure i if and 


only if the data condition whose status is indi- 
cated by es is included in the predicate Ry. 


Proposition 1. The predicate, R; can be rep- 
resented as a set of binary relevance indications 


associated (and in conjunction) with each of the 
data conditions whose status is maintained in S. 
This proposition follows directly from the pre- 
vious definitions. 


Definition 4. The relevance matrix, R is 
comprised of binary relevance indications, oe 


indicating the relevance of a data condition j to 
enabling procedure i. Relevance is indicated by 
r.. = 0, irrelevance by r,, = l. 
ij ij 

Definition 5. The logical dot product, of a 
matrix M (with dimension IxJ) and a vector, W 
(a vector of dimension J) is defined as the vector, 
P = M°W, with dimension I, where 


J 
Py, = A M,.v W. 
t 

jr Oo 


In this equation (and throughout this paper) 
the following symbol definitions apply: 


N 

A x, = XA EAs + AX. 
n=1 

A = logical "AND", 

v = logical "OR". 


Proposition 2. The system eligibility vector, 
E can be computed appropriate to a given state of 


‘the system by generating the logical data product 


of the relevance matrix, R and the system status 
vector, S. 


Proof: 


From definition 5 is follows that: 


J 
oa = A ri, v8, 


From definitions 4 and 1 it follows that 
44 Vv 5 = ] if and only if data condition j is 
either met or irrelevant to enabling procedure i. 
Then by proposition 1 it follows that [R's | a 1 


if and only if all data conditions of the predi- 
cate R, are satisfied. Thus, [R°s], = E. by 


definition 2, and it is proved that E = R°S as 
proposed. 


There is now a prescription for determining 
procedure eligibilities based on system status and 
the procedures' data conditional requirements. 
What remains to be shown is the computation of the 
new system status vector appropriate to having 
completed a given procedure. 
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system Status Update 


Since Keller did not address actual imple- 
mentations of named transition systems, it was not 
incumbent upon his work to address representation 
and the associated maintenance of state. In this 
paper however, we posit that there are J data con- 
ditions (propositions concerning the data set) 
whose status (true or false) will provide suffi- 
cient information concerning the state of the sys- 
tem to effect any and all of the named transitions 
defined for the system. But the status indications 
for these data conditions maintained in the S vec- 
tor are not a part of the data set € associated 
with the transformations p,). Therefore, they 


must be updated separately in order that changes 
of state be reflected in the system status vector. 


There are several possible implications on 
the status of a data condition at the completion 
of a procedure which implements the data trans- 
formation. They are as follows: 


1. The data condition's status remains un- 
affected by the procedure running to completion.: 


2. The 
whenever the 


data condition's status is satisfied 
procedure runs to completion. 


3. The 
whenever the 


data condition's status is negated 
procedure runs to completion. 


4. The data condition's status is determined 
dynamically during the execution of the procedure. 


The fixed constructs which are implemented to 
effect system status modifications are described 
below: 
of the 


Definition 6. The jth element, th 


true condition vector, T, is a binary status in- 


dication associated with procedure i and the 
data condition, j such that th = 1 implies the 


data condition j is either satisfied or unchanged 
by the completion of procedure i. 
of the 


Definition 7. The jth element, fy 


false condition vector, Fs is a binary status in- 


dication associated with the procedure i and the 
data condition, j. The element = 13 = 1 implies 


the data condition j is either negated or unchanged 
by the completion of procedure i. 


Definition 8. The variable condition update 
vector, V is a set of binary status indications 
which can be set dynamically by a procedure run- 
ning in a sequential processor. The component V. 


is set to 1 by the procedure to indicate that data 
condition j is satisfied or y; is set to 0 to in- 


dicate data condition j is not satisfied. For 
elements in S that are not to be dynamically up- 
dated, the associated element in the V vector can 
be set to either O or l. 


Proposition 3. The four possible implications 
on change in system status following completion of 
procedure i can be computed according to the for- 
mula: 


5 = (S_,A 


new of, Ty> VY (ys Flv CFA Vy) 


where the bar indicates the logical NOT operation. 
Proof: The proof follows directly from the 
definitions of the associated vectors as shown in 


Table I. 


TABLE I SYSTEM STATUS UPDATE POSSIBILITIES 


Implications to 
System Status 


unchanged 


1 0 set true 
0 A set false 
0 0 set variably 


It should be noted that there are many forms 
which definitions 6 through 8 and proposition 3 
could have taken. The expression which we have 
used has the advantage of restricting the range of 
V such that a procedure can dynamically modify 
only conditions for which it is authorized. 


Proposition 4. The rangeof V is restricted 
such that V can modify only a specific subset of 
the data conditions, j. This subset is determined 
by T, and Fo for procedure i such that S, is deter- 


J 
mined by V, if and only if t,. = 0 and f,. = 0. 
J 1J Lj 


Proof: The implied new values of ae for the 


various values of ie and ae from proposition 3 


are shown in Table I from which the proposition 
follows directly. 


It should be noted that there are also implied 
modifications to system status at entry to a pro- 
cedure; these modifications are to prohibit the 
same transition from being attempted in other proc- 
essors by denying subsequent update access to 
relevant portions of € when €' = W.(&) has been 
initiated. 7 


In order to accommodate exclusive data access, 
another construct must be added to negate avail- 
ability of data which is to be updated by a cur- 
rently activated procedure. The update is required 
to insure that read/write conflicts do not arise 
between procedures whose execution is not "indi- 
visible". A discussion of the concept and defini- 
tion of indivisibility can be found in references 
[9] and [11]. To implement this update, a vector 
A; has been defined which is associated with each 


procedure, i to specify the status update implied 
on entry to that procedure. 


Definition 9. The vector A, is a set of bi- 


nary status conditions As5? where the index j is 
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associated with the data conditions whose status 
is maintained in S. aa, = ] if and only if the 


jth data condition is a mutually exclusive data 
availability condition required at entry to pro- 


a 


cedure i; ae = 0 otherwise. 


Modifying the system status 


NEW Cotp’ 43 


prior to entry is sufficient to effect contempor- 
aneous access protection for procedure i. 


Proposition 5. 


vector according to the formula S 


The proof of this proposition follows immedi- 
ately from definitions 1, 4 and 9 and proposition 
2 if there are no procedures activated prior to 
activating procedure i which are affected by or 
affect these mutually exclusive data availability 
conditions. If such procedures are currently 
active, procedure i would not have become eligible. 
(Refer to Keller [9] for definitions of commuta- 
tivity and persistence as they relate to valid 
parallel programs.) 


Proposition 6. If A, is identical to the 


ith row in R for all i, then all procedures with 
any entry conditions in common must execute se- 
quentially. 


The proof of this proposition follows as a 
special case of proposition 5. 


Proposition 7. Modifying the system status 


vector according to the formula SEW = Sorp © A, 


restores S to its original value. 


Proof: The proof of this proposition follows 
directly from definition 9 and proposition 5 if 
there are no changes to S between entry and exit 
of the ith procedure. When there are other pro- 
cedures initiated or terminated in the interval, 
the proof holds because no procedures can proceed 
in parallel if they are affected by or affect the 
same data availability condition covered by A. 


(Refer to Keller [9] for definitions of commuta- 
tivity and persistence.) Therefore, for every 
condition for which a5; = Q there will have been 


no intermediate change to S, and the proof is 
completed. Z 


Proposition 8. The change in system status 
following completion of procedure i can be com- 
puted according to the formula: 


= A F F,A 
yew orp’ Se tp ww Aty® BO eh YD 
The proof follows directly from the proofs 


of propositions 3 and 7. 


It has been shown in reference [14] that 
interrupts can be integrated into the model in a 
near conventional manner. Externally activated 
procedures are defined for them which can never 
become eligible based upon their R, vector, but 


which have an associated system status update 
identical to internally activated procedures, when 


they exit. This updated system status will then 
activate appropriate interrupt processing pro- 
cedures. 


Procedure Activation 


The emphasis of the preceding definitions and 
propositions has been to create a basis for deter- 
mining the eligibility of the individual data 
transformations which comprise the computation, as 
well as to maintain a current system status vector. 
In effect we have a sufficient basis for the deter- 
mination of the "When R,(&)". This does not how- 


ever include a sufficient basis for the activation 
of the data transformations, i.e., the "DO &' = 
We". As a basis for this activation procedure, 


it will suffice to maintain a triple of descriptors 
for each procedure (READ, ; WRITE, and EXECUTE, ) in 


the System Controller. These descriptors desig- 
nate respectively the elements of the data base, & 
which are to be read, the elements of the data 
base, €' which are to be written, and the starting 
address of the executable procedure, W, which 


implements the data transformation. The appro- 
priate descriptor triple can be transferred to 

the interface registers to effect activation of the 
procedure in the requesting processor as shown 

in Figure 2. 


The addressing structure of the application 
programs which implement the transformations is 
shown in Figure 3. The EXECUTE register value 


WRITE 


EXECUTE READ 


tainment in multilevel secure systems [15]. 


Processor Type Accommodations 


The architecture model has been extended to 
include heterogeneous processor types. This is 
effected by maintaining a processor type designa- 
tion for each procedure, TYPE. . The procedure 


eligibility determination includes an evaluation 
of the equivalence between the type of the re- 
questing processor and the type designation of the 
eligible procedure. Thus, if the defined proce- 
dure requires an I/O activity, an I/O controller 
would be specified as a requirement for the pro- 
cedure. Having incorporated this approach into 
the model allows procedures to specify a special 
processor type such as floating point processors, 
vector instruction set processor, byte or word 
oriented processor, or just a specific processor 
model if several are multiprocessed in the same 
configuration. 


The data construct, TYPE in Figure 2 is de- 
fined to accommodate this capability. In addition, 
each processor must have its own type identifi- 
cation available to the eligibility determining 
logic in the interface registers. 


System Controller Hardware Organization 


The System Controller is the device that is 
designed to contain the fixed data constructs for 
each procedure, performs the logic to determine 
procedure eligibility and system status updates, 
and assigns activities to processors. The device 
described in this section is a specific design 
based on the architecture model which has just 
been described. This design is applicable to 
either single or multiple processor systems. 
Figure 4 is a functional block diagram of this 
device. The content and function of the major 
blocks in the diagram are described below: 
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FIGURE 3: 


transferred to the processor when the procedure is 
activated specifies the initial program counter 
value to be used. Data accesses by the program 
must be implemented with displacements relative 
to pointer packets whose starting addresses are 
indicated by either the READ.or WRITE descriptor 
register value. This displacement specifies a 
particular descriptor value which in turn points 
to the data item being referenced. This scheme 
accommodates unique arguments. to re-enterable 
programs as well as providing a basis for con- 


SYSTEM 
STATUS 
UPDATE 
LOGIC 


PROCEDURE 
ELIGIBILITY 

DETERMINATION 
LOGIC 


FIGURE 4: 


FUNCTIONAL BLOCK DIAGRAM OF THE 
SYSTEM CONTROLLER 
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Processor/System Controller Interface 


The interface block contains all data and 
control registers accessible to the processors. 
The structure and use of these registers are as 
follows: 


STATUS a 3-bit read/write register whose bits 
are labeled P/P, B, and X, and which con- 
tain the following synchronization, pro- 
tocol and mode request information. 

P/P is a 1-bit binary semaphore used to pre- 


vent multiple processors from accessing 
the System Controller interface registers 
simultaneously. The P/P semaphore is set 
when a processor is accessing the System 
Controller and it is reset when no proc- 
essor is currently accessing the System 
Controller. 


B is used to prevent the processors from 
accessing the System Controller while it 
is busy servicing a request. When a proc- 
essor makes a request to the System Con- 
troller, it waits until B is reset, sets 
X to the appropriate value, and sets B 
true. This activates the System Control- 
ler which resets B when the request has 
been serviced. 


X is used to notify the System Controller 
of the type of service being requested. 
X is set (true) by the processors when 
the service requested is the result of a 
procedure exiting and it is reset when 
an activity is requested. X is only 
required in multiple processor imple- 
mentations. 
TYPE is a register used to contain the proc- 
essor type identification. The System 
Controller uses this register to deter- 
mine the next eligible procedure whose 
identification is to be loaded into 
INDEX. TYPE contains the processor cate- 
gory appropriate to the processor making 
the request. The System Controller re- 
turns the index of the next eligible 
procedure, whose type matches the value 
in the TYPE register. 
INDEX is a register used to contain the identi- 
fication of either the assigned procedure 
or the procedure currently being exited. 
As the fulfillment of processor activity 
requests, the System Controller loads 
INDEX with the index of the next eligible 
procedure whose type matches the value 
contained in the TYPE register, or INDEX 
is loaded with a 0 if no procedures of 
the appropriate processor type are eli- 
gible. When a procedure exits, the Sys- 
tem Controller assumes INDEX contains 
the associated procedure index. 
EXECUTE contains the entry point of the procedure 
whose index is contained in INDEX. (Refer 
to Figure 3.) EXECUTE is loaded by the 
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System Controller as the result of the 
activity request. EXECUTE is unused 
when an exit is requested. 
READ contains an indirect pointer to the glo- 
bal data item(s) accessible to the associ- 
ated procedure in a read capacity. (Refer 
to Figure 3.) READ is loaded by the Sys- 
tem Controller as the result of the ac- 
tivity request. READ is unused when an 
exit is requested. 
WRITE contains an indirect pointer to the global 
data item(s) accessible to the associated 
procedure in a write capacity. (Refer to 
Figure 3.) WRITE is loaded by the System 
Controller as the result of the activity 
request. WRITE is unused when an exit 
is requested. 


V contains the variable status update vec- 
tor loaded by the processors upon exit 
from a procedure. This vector allows a 
procedure to return variable data con- 
dition status to the system status vector. 
Notice that his allows the task to modify 
only selected data elements since any 
attempt to modify unauthorized data will 
be masked out by the T and F vectors, 
stored internally to the System Controller. 


By allowing the processors to access only the 
data and status registers defined above, all 
system control logic is localized to the System 
Controller. This also prevents the processors 
from accessing unauthorized programs, data, or 
control information, providing a natural basis 
for implementing secure systems. The security 
aspects of Transition Machines are discussed in 
reference [15]. 


Data Constructs 


The data block is comprised of memory modules 
which contain the data structures required to 
control the system transitions, as shown in 
Figure 2. These include the EXECUTE, READ, WRITE, 
and TYPE arrays and the T,F,A, and R matrices as 
defined previously. The data block receives one 
input, the data select bus which addresses each 
of the EXECUTE, READ, WRITE, TYPE, T,F,R, and A 
constructs concurrently, causing the element 
indexed in each of these memories to be output on 
its associated data bus. 


It is assumed that there is a load capability 
which will allow the programmer to change the 
content of these memory modules. The content must 
necessarily change during program development or 
in real-time in large systems where the matrices 
will be. overlayed dynamically during the execution 
of the system (analogous to overlaying the task 
control blocks by a conventional operating system). 
For dedicated special purpose applications, how- 
ever, these constructs could be fixed and put into 
read-only memory. The load procedures are system- 
dependent and are therefore not a subject of this 
paper. An implementation applicable to large 
general-purpose systems is discussed further on 


and is the subject of continuing research. The 
sizing of these memory modules is addressed 


further on in this paper also. 


Fixed Transition Logic 


The fixed transition logic block contains 
the combinational logic necessary to update the 
system status vector and to determine procedure 
eligibility. This block requires the T,F,V, and 
A vectors for the procedure currently being 
assigned or exited in order to generate the new 
system status vector, S. The logic expression 
for the new S vector generated as the result of 
either an activation or procedure exit request 
are shown in Figure 5. This diagram uses the 


X S 


A, T F Vv 


MULTIPLEXER 


FIGURE 5: SYSTEM STATUS UPDATE LOGIC 


convention defined in Figure 6 of using a "/" on 
a line to indicate multiple lines treated identi- 


cally. 
Ay By Ay By A,B, 
FIGURE 6: TREATMENT OF MULTIPLE LINE GATE INPUTS 


The combinational logic also combines the 
current S vector with successive rows from the R 
matrix and compares the successive elements of 
the type array with the TYPE register to deter- 
mine the eligibility of each procedure. A single 
output bit, E. is provided as an input to the 


synchronization and control logic. The logic to 


obtain ES is shown in Figure 7. 
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FIGURE 7: PROCEDURE ELIGIBILITY DETERMINATION 


LOGIC 


The System Controller design described here 
assumes the eligibility vector (E) is computed 
one element at a time. This need not be the case. 
An associative memory can be used to generate the 
entire E vector in parallel which will result in 
much faster transition speeds but requires more 
hardware support. 


Synchronization and Control Logic 


The synchronization and control logic syn- 
chronizes the operation of all the components of 
the System Controller. The control logic operates 
as shown in Figure 8. The System Controller waits 
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FIGURE 8: 


in an idle state until its execution is initiated 
by a processor request (i.e., B set). If X is 
set, the System Controller initiates a system 
status update and if X is not set, a procedure 
eligibility determination and assignment is 
initiated. 


When a procedure exit request is initiated, 
the non-zero procedure index provided by the proc- 
essor is used as the address selection value on 
the data select bus. This in turn causes the ap- 
propriate array elements to be output on each of 
the memory data buses. The fixed transition logic 
then updates the system status vector to effect 
the update implied by the procedure exit. The B 
indicator in the STATUS register is then reset to 
indicate the request has been serviced. 


The procedure: activation request causes suc-— 
cessive rows of the R matrix and the TYPE array to 
be output on their respective data buses. As each 
row is output, the fixed transition logic generates 
the next element of E. INDEX is loaded with either 
the procedure index for the first eligible proce- 
dure or with a zero signifying that no procedure 
is currently eligible. If INDEX is non-zero, the 
EXECUTE, READ, and WRITE pointers associated with 
the indexed procedure are transferred to their 
respective interface registers. The fixed tran- 
sition logic then generates the new S vector ap- 
propriate to the protection of the assigned pro- 
cedure. At this point, a complete entry transition 
has been effected. The System Controller busy 
indicator, B is then reset to allow the processors 
access to the interface registers again. 


The detailed processor logic and System Con- 
troller interface and internal logic are provided 
in the syntactical expressions of Figure 9 and 10. 
These figures have didactic value as indicative of 
a source: language structure applicable to appli- 
cation programs to be run on Transition Machines. 


Performance Characteristics 


A throughput performance model was developed 
which predicted the throughput capabilities of 
parallel transition Machines [13]. This model was 
made general enough to include multiprocessors with 
software executive control mechanisms. The three 
major contributions to system overhead that were 
examined are memory contention, the overhead as- 
sociated with the central control mechanism, and 
control mechanism lockout experienced while waiting 
for a request to be serviced. Memory contention 
contributions were assumed to arise even from proc- 
essors which are determining their next application 
program assignment (i.e., currently executing the 
executive program in the case of a conventional 
system or waiting for the System Controller to 
service the outstanding procedure entry request 
in Transition Machines). Since Parallel Transition 
Machines can be designed to be exempt from this 
contribution to memory contention, measured per- 
formance should be better than predicted by the 
model in this regard. 


The significant performance parameter was 
shown to be — = A/@, where @ is the characteristic 
System Controller overhead per application proce- 
dure (eligibility determination and system status 
update), and A is the characteristic application 
procedure execution time requirement. The value 
of determines the number of processors that 


can be effectively combined in a tightly coupled 
mode of operation as described in the reference. 


When Exit false (Entry) . 
When test and set of P/P semaphore true 
When B false (System Controller not busy) 
Begin: Assignment Processing 
Store TYPE 
Store X false (entry) 
Set B true (activate System Controller) 
When B false (System Controller not busy) 
If INDEX#0 then 
Load INDEX 
Load READ 
Load WRITE 
Load EXECUTE 
Clear P/P semaphore 
Transfer Control to EXECUTE 
Set Exit true 
Else . 
Clear P/P semaphore 
‘End: Assignment Processing 


When Exit true 
When test and set of P/P semaphore true 
When B false (System Controller not busy) 
Begin: Exit Processing 
Store INDEX 
Store V 
Set X true (exit) 
Set B true (activate System Controller) 
Set Exit false 
Clear P/P semaphore 
End: Exit Processing 


FIGURE 9: PROCESSOR CONTROL LOGIC 


When B true 

Begin: System Controller logic 
If X true (exit) then 
i = INDEX 


SweW = (Sop v A-)A T.) v (FiA Vv) v (TAF. )) 
Else (entry) 

Clear i 

While E. = 0 and i<I 


Increment i 
J 
E, = \ (s; v rj) A (TYPE @ TYPE, ) 


If E. = 0 then 


INDEX = 0 
Else 

INDEX = i 
READ = READ, 


WRITE = WRITE. 
EXECUTE = EXECUTE. 
Snew = (Soup A Ay) 
Set B false 
End: System Controller Logic 


FIGURE 10: SYSTEM CONTROLLER LOGIC 


The overhead, ¢ is very dependent upon the 
component technology used in the development of 
the System Controller, A is dependent upon the 
speed of the individual processors. Current com- 
ponent technology will support @ < 1 microsecond. 
With microprocessors, A > 100 microseconds is very 
conservative. This yields P= 100, which indicates 
according to the model that on the order of 100 
microprocessors could be combined in a tightly 
coupled mode of operation controlled by a single 


System Controller with a proportionate throughput 
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capability. 


Lockout is the primary reason for the flat- 
tening of the throughput curve as a function of 


the number of processors. To avoid this problem 

in Parallel Transition Machines, multiple system 
Controllers (including separate interface registers) 
can be incorporated so that if a processor is locked 
out of one System Controller it can attempt to 
acquire another, etc. Thus, in a batch type systen, 
many disjoint computations could be running con- 
temporaneously across all processors. 


A given computation will be characterized 
by some maximum and average numbers of concurrent 
execution paths. (See for example Kuck [10].) 
The average concurrency will determine the proc- 
essor utilization realizable during the execution 
of a given computation. Thus, in general where 
many processors are included in a configuration, 
there is a requirement for concurrency of active 
computations in order to achieve efficient utili- 
zation of processors. This applies particularly 
where a large system is being used for many small 
computations. 


The upper limit on throughput in Parallel 
Transition Machines is thus determined by other 
than system control considerations. Physical 
module interconnection schemes now become the 
limiting factors. High speed buses and processor/ 
memory groupings [12] are likely approaches to 
extending these limits. 


system Controller Memory Sizing 


In what has been presented so far, there has 
been the implicit assumption that all of the con- 
structs associated with relevant data conditions 
and programmed procedures for an entire system can 
be accommodated in the data constructs memory 
modules in the System Controller. In a prototype 
system currently in the development stages, these 
constructs are contained in a single module and 
are allocated dimensions of 32 conditions by 32 
procedures. Analytic studies and simulations have 
indicated that as a rule of thumb there are one 
and one-half times as many conditions required to 
control a tightly coupled network of procedures 
as the number of procedures involved. This would 
indicate that optimum memory utilization would be 
more probable for System Controllers characterized 
by such dimensional ratios. 


How large these memories should be made to 
support general applications is a key issue. 
There is no real problem in sizing these arbitrarily 
large, but application systems have a way of out- 
growing single physical modules. Methods have 
therefore been investigated which extend the logi- 
cal capacity of the System Controller in modular 
incremental units to justify the development of a 
standard System Controller applicable across a 
broad spectrum of system sizes. 


The most direct method of extending capacity 
is to directly increase the dimensions of the mem- 
ory in the System Controller. The design described 
in the previous section can be expanded to accom- 
modate more data conditions (i.e., horizontal ex- 
pansion of the arrays) or more procedures (i.e., 
vertical expansion of the arrays) by cascading 
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multiple sets of the standard component blocks 
which are then connected to a common processor/ 
System Controller interface. This facilitates the 
construction of an arbitrarily large System Con- 
troller by the interconnection of many standard 
System Controller components each of fixed size. 
This cascading of component System Controllers is 
shown in Figure 11 for horizontal expansion. A 
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FIGURE 11: MODULAR HORIZONTAL EXPANSION OF THE. 


SYSTEM CONTROLLER 


similar approach is applicable to vertical expan- 
sion, where a master System Controller is used to 
multiplex between vertical segments, each of which 
is controlled by its own System Controller. These 
methods of modular expansion do not increase worst 
case transition times and can be applied in a 
structured physical hierarchy. 


Virtual Transition Machines 


Memory size problems are not new to computing, 
and the resolution of former problems encountered 
with insufficient main memory can be applied di- 
rectly to the System Controller memories. The 
concept of virtual memory and commensurable virtual 


machines is particularly germain to Transition 
Machines. The segmentation and paging of the data 
constructs required by Transition Machines can be 
an integral part of a hierarchical application 
program design approach. The approach encompasses 
the design and implementation of systems as a com- 
plete transition system at each level in a hier- 
archy. This approach incorporates the capability 
of associating an indentured R matrix with a row 

in a higher level matrix. Conditions appropriate 
to each matrix level include only those which are 
relevant to the procedures (or immediate further 
indentured matrices) at this level. This top-down 
recursion can proceed in the extreme all the way 
down to where the procedures become the instruction 
set of processors or even indivisible operators. 
This instruction set can then be restricted to 
exclude branching type instructions. In fact, even 
going up one level, the Parallel Transition Machines 
can be used to implement a completely general sys- 
tem in conventional processors without requiring a 
branch type (GOTO) instruction in the application 
domain of the processors. The implications to 


software productivity are discussed in reference [1]. 


To implement the logical extension of the data 
constructs by partitioning the R, T, F and A mat- 
rices, additional data constructs and algorithms 
will be required to effect the dynamic real-time 
loading and overlaying of procedures. The required 
constructs are the following: 


1. System Controller identification 
2. Relevance matrix identification 


3. Indication of whether a procedures or an 
indentured matrix is associated with each row in 
an R matrix. 


4. Controller active indication for each 
System Controller 


5. Logical parent identification (relevance 
matrix identification of parent) 


6. Physical parent identification (System 
Controller identification of parent) 


These data constructs are felt to be suffici- 
ent to implement a global operating system which 
results in a virtual implementation of the total 
system relevance matrix. Such an operating system 
is the subject of current and anticipated future 
research. 


Conclusions 


It has been demonstrated that even though there 
are no completely general parallel computation ar- 
chitectures commercially available, such computers 
are nonetheless realizable. Parallel Transition 
Machines which meet these specifications are de- 
fined to a level where credibility is established. 

A prototype of such a machine is currently in the 
developmental stages at the Boeing Aerospace Com- 
pany. A design has been presented in this paper 
which is extremely flexible to meeting a wide range 
of implementation variations. Such machines appear 
to have considerable advantages over current machine 
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architectures in several areas. These area include: 
Multiprocessing throughput, software productivity, 
and ADP security. 


It is left to the future to develop the com- 
pilers, linkage editors and overlaying operating 
systems appropriate to the application of such 
computers. 
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PERFORMANCE EVALUATION AND RESOURCE OPTIMIZATION 
OF MULTIPLE SIMD COMPUTER ORGANIZATIONS 
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W. Lafayette, Indiana 


Abstract -- Multiprocessor computer systems, 
which can be used to execute multiple number of 
SIMD vector jobs, are modeled and analyzed in this 


paper. Such a parallel computer organization has 
usually multiple Control Units (CU's) sharing a 
pool of dynamically allocated Processing Elements 


(PE's). We characterize the MSIMD machine as a 
finite-storage multi~server M/M/K/2 queueing fami- 
Ly where the number of active servers, K, isan 
upper bounded random variable and 2 is the maximal 
queue length. Analytic results using conditional 
probabilities are obtained under equilibrium con- 


ditions. These results can be readily applied to 
evaluate the performance of multiple SIMD 
machines. The system performance is measured by 


the resource utilization factors CU's and PE's and 
by the average job response time. Given a_ fixed 
number of CU's and a prespecified workload, sys- 
tematic procedures are given to determine the op- 
timal size of the resource pool of PE's and the 
sufficient queue capacity for MSIMD operations. 


I. INTRODUCTION 

This paper presents the results of an analytic 
study to evaluate the performance of shared- 
resource Multiple Single Instruction stream and 


Multiple Data streams (MSIMD) machine organiza- 
tions. An MSIMD machine is composed of more’ than 
one Control Units (CU's) sharing a pool of finite 
number of Processing Elements (PE's) through a in- 
terconnection switching network. To facilitate 
generalized discussions, we assume that the system 
has m CU's and r PE's interconnected through a 
full. crossbar switching network. Furthermore, 
sufficient Large Local memories and 1/0 facilities 
are attached with each CU and each PE in the sys~ 
tem. (Fig.1). 


Each CU is required to be allocated with a sub- 
set of PE's for the execution of a given SIMD job 
(a vector process). The word "job" will be used 
here to denote some identifiable piece of work 
that is Logically independent of all other jobs. 
Thus, the work on one job is never dependent on 
the work of another; the only way the jobs in- 


teract with each other is by their independent 
needs for the same _ resources. Several parallel 
machines capable of executing multiple SIMD jobs 


The original IL- 
Senzig [C15] 
CU's. 


have been proposed in the past. 
LIAC IV design was for four CU's [1]. 
Studied an array processor with multiple 


*This research was supported in part by NSF Grant 
MCS-78-18906. 
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Radoy and Lipovski [13] proposed multiple array 
processing with switched multiple instructions and 
multiple data streams. In the MAP system, Nutt 
[10]-£12] presented a series of studies on MSIMD 
machines including operating system strategies and 
performance evaluations. Hwang and Ni proposed a 
dynamically allocated multiple SIMD/SISD architec- 
ture [5]. Recently, a reconfigurable multiproces- 


sor system, called pM, was proposed by a research 
team at Purdue University for pattern recognition 


and image processing applications [2]. pM" is 
designed for performing MIMD and MSIMD operations 


through a dynamic reconfiguration approach. 


In this study, we will not partition the set of 
PE's into fixed subsets. Instead, we assume a 
completely accessible crossbar switch which can 
establish all the possible interconnections 
between CU's and PE's. This is necessary for 
dynamic partitioning of the PE set as required in 
our analysis. The subset of PE's allocated to a> 
given CU may vary in size according to the job re- 
quirement. The time required to allocate PE's to 
the designate CU is considered part of the CU ser- 
vice time. Various types of queueing models have 
been proposed to study the performance of computer 
systems by a number of authors [3,4,14]. This pa- 
per presents the first attempt to model MSIMD 
machines in an analytic fashion.Numerical plottings 
of the analytic results are presented with commentr 
aries and comparisons. 


An analytical solution is obtained by modeling 
the MSIMD machines as an M/M/K/2 queueing family, 
where M and M represent respectively the exponen- 
tial interarrival time and the exponential service 
time. The number of active servers (CU's), K, is 
a random variable and g is the maximum queue 
Length. The MSIMD system performance is measured 
by the utilization of CU's and of PE's and by the 
average response time for jobs with arbitrarily 
distributed vector sizes. Resource organization 
on optimal PE-pool size and method to determine 
the sufficient buffer size for specified wordload 
distribution are also presented. 


II. THE QUEUEING MODEL FOR MSIMD MACHINES 


The MSIMD machine organizations illustrated in 
Fig.1 can be modeled by the queueing network in 
Fig.2. Specified below are the general charac- 
teristics of the model for MSIMD machine organiza~ 
tions and queueing disciplines to be used in our 
study. 


(b) 


Fig. 1 (a) SIMD machine organization. 
(b) Multiple SIMD (MSIMD) machtne organiza- 
tion 


A: Polsson Arrival Rate of Input Processes 


ne Exponential Service Rate of each Control Unit 


The Shared-Resource 

Pool of r PE's 

Max Fmum 
queue length 


Fig. 2 The MSIMD model with m Control Units (CU's) 


and a shared-resource pool of r Processing 
Elements (PE's). 


1. The Servers (CU's). 

There are m identical servers. called the 
Control Units (CU's). The service time of each CU 
is assumed to be exponentially distributed with 


service rate u. Each CU can Supervise the execu- 
tion of one SIMD job at a time and at most m_ SIMD 
jobs can be executed simultaneously in the system. 
ALL the servers are uniprogrammed and operate in- 
dependently except sharing the same pool of PE's 
under the control of an executive system which may 
reside in a separate manager processor. 


@. The Shared Resource Pool (PE's). 
There are r identical Processing Elements. 
(PE's) shared by all the CU's. Each PE can be 


Control Units 
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only allocated to one CU at one time through the m 
x r crossbar switching network. Note that m addi- 
tional crosspoints are needed per each additional 
PE. We call this the switching overhead per each 
PE. ALL the PE's are independent and = unipro- 
grammed to accept one assignment at a time. 


3. The Job Arrival Process. 
The arrival of the SIMD jobs assumes a_ Poisson 
distribution. We shall write 
= KT n 
= _e (At) 
P tm (t) =n} = ai (1) 


as the probability that there will be n SIMD jobs 


arriving in time period t, with the mean arrival 
rate i. 
4. The SIMD Job Characteristics. 


Let {X.,1=1,2,.+.3 be a sequence of positive, 


independent random variables with 
F's The PE demand, x. 


represents the maximum number of PE's required for 
the i-th SIMD job. The probability distribution F 
of 0X. is called the PE-demand distribution. The 


values of x. are confined within the C1,rJ 


the number of 


integer-vaLlued, 
a common distribution 


range 


where r corresponds to maximum 


available PE's. 


The PE-demand distribution F does not follow a 
specific pattern in reality. Nutt (12J] has as- 
sumed a normal distribution for F. Our analytic 
results are obtained based on arbitrary distribu- 
tion of F. Analytic results are numerically plot- 
ted with respect to a "truncated" normal distribu- 
tion for F. For real SIMD job, the vector size may 
be greater than the maximum pool size r. Under 
such circumstances, the long vector instruction 
must be partitioned into several subvector in- 
structions, chained together to fit the SIMD job 
requirement. The average PE demand is defined by 


r 
n= ECX.] = 2) k © P CX=k} (2) 
r 
k=1 
5. The Maximum Queue Capacity (9). 
For an infinite queue, every arriving SIMD job 


allowed to wait in the queue until service can 
In practical multiprocessor design, 
is unrealistic due to the finite 
We shall assume '' 9" 


is 
be provided. 
this assumption 
capacity of buffer area used. 


to be the maximum number of jobs allowed in the 
queue. When the queue is full, the new jobs are 
turned away. With a finite-storage queue, the 


Actual Job Arrival Rate, ae is less than the Real 
Job Arrival Rate, i, corresponding to an infinite 
queue. 


6. The Queueing Discipline. 


The queueing discipline specifies the order in 
whcih the multiple SIMD jobs are to be executed by 
multiple CU's. Let us first define the states of 


a CU. A job is assigned to a CU, (that CU enters 
a busy state) provided a sufficient number of PE's 
are available to form the task force. Until the 
task force can be adequately formed, the unas- 
signed CU's are still available for assignment and 
in the waiting state. 


The First-Come-First-Serve (FCFS) discipline 
decides the order of service strictly by the order 
of job arrivals. It is the simplest scheduling 
policy to implement. However, it has the disad- 
vantage of having an SIMD job, which demands a 
Large number of PE's from the pool, blocking in 
the front of the queue waiting for sufficient task 
force, even the remaining available PE's in the 
pool may satisfy other small jobs in the queue. 

The Least-PE-Demand-Serve-First (LDSF) discip- 
Line can be used to alleviate the drawbacks of the 
FCFS discipline. The insertions of successive 
jobs must maintain a decreasing PE demand order in 
the queue such that less-PE-demanding jobs are al- 
ways placed ahead of the more-PE-demanding jobs. 
Once a job 1S assigned to a CU, it cannot be 
ejected until the service is completed (that is 
nonpreemptive). 


ITI. SYSTEM PERFORMANCE MEASURES 


Before we evaluate the performance of the MSIMD 
machine organization, we have to define the states 
of the system. The random number, N(t), of SIMD 
jobs residing in the system at time t can be di- 
vided into two parts 


N(t) = Ngo? + No (t) (3) 


where N,(t) and Not) are nonnegative, integer- 


valued random variables representing at time t, 
respectively, the number of jobs waiting in the 
queue and the number of jobs which are currently 
under execution. The system is in state ES; at 
t, when No) = 1 and No) = j where 
O<i< 2, Q<j<m. The probability that the 
system is in-state Ea at time t is denoted by 


Pi, 0t)- Note that P54? = 0 for i>& or jm. 


time 


After the system becomes stabilized, i.e. be- 
ing operational for a long time period, the influ- 
ences of the initial conditions and transient 
responses will be damped out and the number of 
jobs in the system becomes time-independent. The 
system is then entering the so-called "steady 
state". The Limiting probability of Pi; is 


simply denoted by 


P.. 2 Lim P,.(t) (4) 
J { 00 VJ 
Furthermore, in the steady state, we have the 
time-invariant random variables 
N = Lim N(t) = Lim (N (t) +N (t)) =N +N. 
Four system measures are defined below to 
evaluate the performance of MSIMD machines. The 


internal efficiency of an MSIMD system is primari- 
ly determined by its utilization factors of the 
CU's and PE's.. The external performance of the 
system iS indicated by the mean response time of 
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user's jobs. ALL these measures are evaluated 
under the steady-state condition. 
1. The CU utilization Cory? 

The ratio of the expected number of CU's in 


busy state to the total number of CU's in the sys- 
tem defines the CU utilization factor 


(5) 


R 
1= 


O j= 
2. The PE utilization Cop,? 


Not all the PE's allocated to a CU will be busy 
all the time. For example, when a control or 
scalar instruction is executed in the CU directly, 
none of the allocated PE's will be called. Ina 
masked vector operation, portion of the allocated 
PE's may be masked out. In both cases, the dis- 
abled PE's may be Left idling even they have been 
allocated to the CU. For simplicity in analysis, 
we shall consider a PE to be utilized once it is 
allocated to a CU, regardless whether the PE is 
actually executing a broadcasted instruction from 
the CU or it is left idling during the instruction 
cycle. 


4 Q m 
tee, Xs ey se os 
where nj is the average number of PE's' required 
for j SIMD jobs. 
3. The Total System Utilization (p) 
The overall system resource utilization can be 


as a weighted sum of the two utilization 
PDE divided by their maximum 


measured 
factors Pou and 


values. 
p=(neMe oy tr ° Ppp? /(nemtr) (7) 


, Where the weighting factors m and r_ correspond 
to the multiplicity of CU's and PE's in the system 


and n is the average PE-demand in an arbitrary 
SIMD job. The reason that the CU's are weighted 


higher is due to their higher hardware complexity 
and software capability. Whenever a CU is busy, 
an average of n PE's will be drafted for doing the 
job. 


4. The Average Job Response Time (W) 


By Little's formula [8], we can define the 
Job Response Time, W, as the ratio of two . 
parameters — | 


& 

L/A, = CD 
a : 

i=1 

, where L is the average number of jobs in the | 

system including both jobs waiting in the queue 
and those currently under execution and do is the 


m m 
W = = iP;, + ea a (8) 


actual job arrival rate. P ok is the probability | 


(no additional arriving 
jobs are allowed to enter a full queue) when k 
SIMD jobs are being executed in the system. These 
performance measures, Peyr pes and W, will be 


that the queue is- full 


plotted in section V with numerical examples. 


THE ANALYSIS AND PROBABILISTIC RESULTS 


IV. 


Let Sy be a random variable representing the 


total number of PE's required for k independent 
SIMD jobs. We can write Sy in terms of a sequence 


of random variables xX, 


k 
S.= 2 x. where 1 <k <m™ (9) 
ist 7 


Let FY. be the probability distribution of Sy. The 


expected value of Sp. 1s denoted by eS ECS, J. FL 


can be written as the k-fold convolution of F with 
respect to itself k times, where * is the convolu- 
tion operator in the r-domain. We shall write 
FE Cr) = Pts) < r}, which equals FY evaluated at 


point r. 
Fe = Fx F *& °°° x F 
\ ee se 
k times (10) 
Since all random variables x. are independent 


with identical distribution, the expected value ny 


can be written as nN = ELS,j] = k°ELX. J = k°n. 


k 

Given the PE-pool size r. Let K(r) be the ran- 
dom variable representing the 
servers (CU's) in the system. Having k active 
servers means the system can have at most k CU's 
in busy state. The new jobs will waiting in queue 
only when all the k CU's are busy. 


The probability of having k active servers is 
denoted by oo P{K(r) = k}. The expected number 


of active servers is denoted by @ = EL[K(r)J]. The 
number K(r) iS greater than or equal to k, if and 
only if k SIMD jobs can be executed simultaneous- 
ly; that is K(r) >k Sy <r for all 1<k<nm. 


Lemma 1: 


With a given pool of r PE's we can write 


FOr) - Fa (? 1 <k < m-1 
i F (r) k=m (11) 
m 
0 k>m 
Proof: 
Oy = P tk(r) = k} = P tS, <r, S44 > r} 
= graces r}- Pets < r} 
= FL Cr) = Fed (r), for 1< k < m-1. 


number of active 
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a = P {K(r) =m} =P {S <r} = F (r), for k =m. 
m r rom m 
Oy = 0,since there are only m CU's, for k > m. 
co m~1 m=‘ 
2... ty = DS ep ie CEE Rr ek (r)J 
es k+1 
+ F (r) = F,(r) = 1 
m 
Q.E.D. 
Lemma 2: 
The expected number a of active servers can be 
evaluated by 
m 
a= > F (Cr) (12) 
~ n 
n=1 
Proof: 
m 
ao = ECK(r)] = 2 Keay 
k=1 
= a4 + 2a, He SS oe (m-T)a, + mea, 
= Ca, + 5 ty SOS a) + Cas + az Beare ap a? 
t+ eee + (yg t+a)ta 
m=~1 m m 
m m 
= P {K(r) > n} = Pp {S <r} 
2 =. oe Ee 
m 
=. SS For) 
n=1 
Q.£.D. 
For a fixed number of k active servers, the 
system can be described by a finite storage 
M/M/k/2 queue. We can view the collection of m 


M/M/k/2 queues for k = 1,2,...,m as an M/M/K/xz 


queueing family, where K is a random variable 
ranging from 1 to m as defined above. The analyt- 


ic results obtained below are averaged over m pos- 
sible M/M/k/& queues in the M/M/K/& queueing fami- 
ly. These results can be used only as an approxi- 
mated solution to the MSIMD system in a statisti- 
cal sense. Simulation results reported in [9] 
verifies that this approximated solution is indeed 
very close from below to what can be obtained from 
the simulation study using either disciplines. 


Using the theorem on total probability, the 
Limiting probability Ei can be evaluated as 
i = P tN = i,N, = j} 
m 
= & Pet = 1,N. = j/K(r) = k} oe (13) 


where PAtN, = i, N. = j | K(r) = k} is the condi- 


tional probability of the system being in state 


ES given that there are only k active servers. 


We define u = A/y the traffic intensity and Zi 


the probability of an empty system with all k ac- 
tive servers idling and an empty queue, that is Zh 


= PiN, = 0, No = 0 | K(r) = k}. 


The feasible state space of an M/M/k/2 queue is 


P tN, =i, N. = j | K(r) = k} represents the 
steady state probability of the system in state 
Es5- With finite queue length, the system ap- 
proaches the steady state for various values of 
traffic intensity. Proofs of the following 
theorems can be found in [9]. 
Theorem 1: 
P tN, = i, N, = j | K(r) = k} 
u : 
77 * , 170 and O <j <k 
kK ou4 (14) 
= {ju (ui ave : Q 
qr ® Z. , j=k and 1 <i < 
0 , otherwise 
k u vn as : u.n - 
where Z=| >> Gta bk & (15) 
: k nap COU”! k! = k 
Theorem 2. 


The Limiting probability Pa defined in €Eq.5 


can be evaluated by 


m 
Pan = Zz ° a 
oo” 4 “k  %k 
m j 
= U e < 
j : 
uo (ui ; 
Theorem 3. 


CU and the PE utili- 


can be written as 


The CU utilization factor p 


zation factor p 


PE 
m k 
ee = uu, ® 
cu * yk KET ® eee 
k=1 _ 
-mn , 
ePE OU cu ae) 
Corollary: 


The system utilization p can be evaluated by 
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- emn Poy 
p m*ntr 
us uk Ud 
= 2uen|l1 - x, Zy omy ae [corn +r) (19) 
Lemma 5: 


The average queue Length a can be evaluated by 


Je 
m ur °Z. Cu/j) Q 
L= yy — Ss i Bay 
Go f1 jE C-u/5) J 
uh u.], 

7 is es (1 - 2] *; (20) 

The following theorem shows that the average 
job response time W equals the sum of the average 


job waiting time, Lae and the average job ser- 


vice time, 1/H. 


Theorem 4 


V. NUMERICAL EXAMPLES AND COMPARISONS 
Results obtained from the 
displayed and compared by numerical examples in 
this section. The influences of the input work- 
load characteristics are demonstrated. Analyses 
given in previous section are independent of the 
PE-demand distribution F. We assume a "truncated" 
normal distribution for the PE resource pool as 
demanded by the user's jobs. Such truncated nor- 
mal distribution can be formally defined by 


analytic study are 


n 
ExP | - ——_—| , 
ie} 
N on0e 2 n 
X n 7 = 
f <k< 
P(X = k} = oF ees (22) 
0, otherwise 
rg (k- 19° 
where Ny = u EXP = ——h— is the 
k=1 ono 2a7 


normalizing factor. 


The conventional normal distribution with mean 


nN, and standard deviation o., is denoted by 
N(n,o,)« The truncated normal distribution of 
N(n,o,) in range Ci,rJ is denoted 


N,¢n,9) = Noa en Since the truncation is 
not symmetric with respect to the mean Nye 1 may 
be either greater than nae when nS (r-1)/2 or 


Less than Me. when otherwise. The standard devia- 


tion o is always less than ee after the trunca- 
tion. The difference between n and n, is propor- 
tional to the magnitude of on For example, if 
o. = n,/4 for ny ranging from 32 to 256 with 
r=512, no becomes very close to Mn. By Fourier 


transformation, the convolution operator in Eq.10 
can be replaced by multiplication in the frequency 
(a) 
1.0c0 | n= 32 
pee IT | 
900 MERN( WN) = 32.64.128.256 ” 
$.D.(01 = 0.25aMEQN! y ) Va 
-800 
“3 
SS 
700 
& 
z 
.600 
& 
& 
= .m0 
2 n= 128 
‘“ > 3) 
300 
.200 n= 256 
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(c) 
1,000 
MEANI Nn) = 32.64.128,256 eh 
800 
8.0. a0) = O.2S5mMEAN( n) 
000 
3 
-O 
5 n= 32 
z 
& 800 n= 128 
& 
= 
5 500 
f 
w 
x00 
300 n= 256 
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Fig. 3 The system performance measures, 


(b) Poe 


(o,,) 


PE UTILIZATION FACTOR 


AVERAGE JOB RESPONSE TIME (Ww) 
8 
o 


domain. With a given PE~demand distribution F, 
the n-fold convolution Ee can be calculated by us- 


ing Discrete Fourier Transform (DFT) pair. 

The effect on the utilization factors by dif- 
ferent values of n is shown in Figs.3(a,b,c,d). 
We considered four different values of n ranging 


from 32 to 256 with o = n/4. Note that when 
o = n/4 then Ne nN. Fig.3.a shows that for each 


when u increases. With suffi- 
will (flat 


The smallest value of u Leading 


° 
tJ 
6 


n, increases 


P cu 


cient large u, Pp 


become saturated 


CU 
curves as shown). 


to the saturated flat utilization is called the 
sme 
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(a) Poy? 


(c) ep, (d) W, versus the traffic 


intensity u, for different values 
tributions NM,o). 


of dis- 


Saturation point uc: From Eqs.18 and 19, we know 


that the saturation points for Pe and ° coincide 


with Peyr U 
decreases when n increases. This is obvious’ be- 
cause, with small n, ngSt of the CU's are allocat- 


PE 
Fig.5.a shows that the saturated Po 


ed with enough PE's to'‘perform the SIMD jobs. For 
the deterministic case\ (o = 0), the number of ac- 
tive servers @ = min(m, r/n ). The system can 
then be = a. 


modeled as an M/M/a/% queue with U, 


When 9° # Q, because of the variation in PE demand, 
the saturated Poy is less than the corresponding 


deterministic case. When o is small, the satura~ 
tion point u. is close to the expected number of 


active servers Oo. For exampLe, with 
(n,a) = (128,0), the saturated Poy = 0.50 for 
(a = Ue = 4); and with (n,a) = (128,32), the sa- 
turated Poy = 0.44 for (a = 3.56, Uy. 3-9). Poor 


CU utilization may be caused by high PE demand or 
by low traffic intensity. 


Fig.35.b shows that op increases when nn in- 


PE 


creases for low intensity u. For saturated case, 

Pe increases to maximum value and then drops 

down, when n increases. The maximal value on Pog 
: r 

occurs near the point n = a When n << =, there 


are still many idling PE's, even all CU's are 


busy. If n >> <, the number of idling PE's_ can 


hardly satisfy the Large PE-demanding SIMD jobs. 

The poor PE utilization is primarily caused by Low 

traffic intensity or 1 being far away from the ra- 
‘= 

t1o Tm” 

weighted sum 

Fig.3.c shows 


of different 


The system utilization P, is the 


of Poy and PoE as shown in Eq./?. 


the system utilization Pe 
N's. The 
both the Poy 


utilization 


in terms 
system utilization is high only when 
and 9... are high. 


PE The highest system 
occurs at nN=64 in the numeric example 
shown. When 1 approaches: the value — from either 


side of the curve, utilization P improves signifi- 
cantly as shown. Fig.3.d shows that the average 
job response time W increases with u. W increases 
also with because the Larger the PE-demand, the 
less the chance that the job will be allocated 
with enough number of PE's. The above results has 
been reinforced with extensive simulation results 
against the FCFC and the LDSF queueing disciplines 
as reported in (9]. 

VI. RESOURCE OPTIMIZATION METHODS 

In this section, we present analytic methods to 
optimize the organization of resources in an MSIMD 
system with respect to given workload distribu- 
tions and desired performance levels. Cost- 
effective machine organization is essential to 
parallel processing. The first problem deals with 
the determination of optimal number of PE's re- 
quired in the pool given a fixed number of CU's 


92 


Utilization 


and a known workload. The second problem decides 
the sufficient size of the finite-storage job 
queue (or the sufficient capacity of the buffer 
area). | 


A. 


The optimal design of an MSIMD system should 
achieve high system utilization at low system 
cost. Methods to evaluate the total system utili- 
zation P has been presented in previous sections. 
The total system cost of cl MSIMD system can _ be 
estimated as a quantity which is Linearly propor- 
tional to the sum mt’°’r, where © is the cost ratio 
of one PE (including the attached Local memory and 
the crossbar switching overhead) to one CU (in- 
cluding all the attached memory and I/0 facili- 
ties). Now we are ready to define a Resource 
and Cost Ratio (RUCR) using P defined 
in Eq./?. = 


+ 11°m Poy fer + 1°m) 


wO*r+m 
According to Eqs.17,18 and 19, both 0, CU 


of six variables, (r,m,u,2,n,0). 
can write RUCR = RUCR 
as a function of seven variables. 
Given fixed values of m,l,n,u,o, and w, one can 
plot the RUCR(r) as shown in Fig.4. The optimal 
size of the resource pool, say - PE's, is deter- 


°0 
(r PE 


RUCR = (23) 


p — 
O° rtm 
and p 


are functions 
Therefore, we 
(r,m,u,2,n,0,W) 


mined by finding the maximum value of RUCR(r), 


RUCR(r) = Max{RUCR(r)|[r > 1} (24) 

What has demonstrated in Fig.4 corresponds’ to 
an MSIMD system with known parameters (m,2,n,09,w) 
= (8,20,64,16,1/16). <A family of five curves were 
obtained for RUCR(r) with respect to five 
representative values of the traffic intensity u 
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Fig. 4 The Resource-Ut? lizat ton-and-Cost-Ratio 
CRUCR) versus the PE-pool size (r) for five 
different traffic intensities (ud with 


fixed ratio = 1/16. 
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Fig. 5 The Relatitve-Uttlization-Improvement-Fac— 
tor CRUIF) versus the queue length (2) for 


three different traffic intensities (uw). 
ranging from low (u = 1.5) to saturation (u 
> 8.5). Normalized curves with respect to the 


maximal value of each RUCR(r) plot are shown. The 
optimal choice of RS corresponds to the peak of 


each curve. These peak values of PO in Fig.4 in- 


creases with increasing values of the traffic in- 
tensity u. However, once reaching saturation, the 


optimal size ro becomes upper bounded. Obviously, 
the optimal choice of - is affected by the cost 


With fixed traffic inten- 
rapidly for small 


ratio used in Eq.23. 
sity u, the ri decreases 


00.00 MEANT n } = 64 


(a) 


$.0.( 01 = 16 
“8.00 m= 6.r = 5i2 
e = 0.0003 
0.00 t 
' 
= ! 
J 
! 
_ 1.00 e€* 0.000 
pod 
x ' 
2 w.0 ! 
pa 
i 
a 
£0.00 e* 0.001 
! 
t 
16.00 1 
e€ =0.004 
10.00 ' 
t NS 
e€*#0.0 
6.00 i 
t 
ZE ug 7.75 
8.00 
0.00 1.60 3.00 60 6.90 7.50 0.00 10.60 12.00 13.60 


values of the cost ratio 4. The decreasing of 
r 64) becomes much slower for cost ratios greater 


than 1/16. 


B. Determination of Sufficient Queue Length 

With a finite-length queue, the new arriving 
jobs will be turned away when the queue is filled 
up to full capacity. In general, the longer’ the 
queue is, the better the resource utilization 
should be expected. However, for sufficient large 
queue Length, such improvement may become negligi- 
ble. In practice, most computers have’ finite 
buffer areas for the input queues. The effective 
queue Length can be determined by using the _ fol- 


lowing Relative Utilization Improvement Factor 
(RUIF). 
_ PC(%+1) - (2) 
RUIF(2) = onan y (> anal (25) 


, where the numerator represents the magnitude of 
utilization improvement in an MSIMD system with 
two consecutive queue Lengths. 


Given a very small number e¢, the sufficient 
queue Length can be determined by finding the 
smallest queue Length, hee such that RUIF (2) <e, 


In other words, RUIF (2 -k) > e€ for all k > 1. The 


family of RUIF(2) curves plotted in Fig.5 shows 
that the relative improvement RUIF(2) decreases 
rapidly with increasing queue Length. If one draw 
a horizontal Line in Fig.5, say an e-line, the 
sufficient queue length, hee for various traffic 


intensities are immediately revealed at all the 
intersections points. When the intensity u ap- 
proaches saturation, the sufficient queue length 
turns out to be upper bounded by a constant. 
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Fig. 6 The sufficient queue length a) versus the 


traffic 


intensity (u) for five RUIF bounds 


(CE). (Cad (1,0) = (64,16) and 1 7.05 
(b) (N,0) = (128,32) and ie 4.25 


This upper bound can be further illustrated by 
the curves of sufficient-queue-Length or versus 


the traffic-intensity (u) shown in Fig.6. Five 
curves are shown corresponding to five representa- 
tive € values. The peak of each €-curve marks the 
maximum sufficient queue Length, ve required, that 


is & (ud a. for all u. The larger the ¢€ is 
valued, the shorter the maximal queue length is 
required. Theoretically speaking, the peak ap- 


proach infinity when © goes to zero. With reason- 
ably small €, the i is still finite as shown. 


The system parameters given in Part (a) and Part 
(b) of Fig-6 are essentially the same except with 
different 1 of PE-demands. It is interesting to 
point out that the maximum sufficient queue 
Lengths Le) for various € values occur at the 


same traffice intensity, when u approaches the sa- 
turation point Us: With two different pairs of 


(n,5) = (64,16), and (128,32) in Parts (a) and 
(b), respectively, the maximal sufficient queue 
Lengths are determined at the saturation points 
Oe.'= 7.75 and u, = 4.25 in Fig.6 respectively. 


Note that the sufficient queue lengths decreases 
rapidly on both sides of the critical traffic in- 
tensities which produce all the peak values. With 
small u, there is no need to use a long queue, be- 
cause the queue can hardly filled up. When u_ is 
very large even with a short queue, the queue is 
hardly emptied. Thus, the server can always get a 
job from the queue. Theoretically, ® can be zero 
when u goes infinity. The above procedures can be 
used to determine both the maximal sufficient 


queue Length (the peaks) and the sufficient queue 
Length for any given traffic intensity. However, 
only the maximal sufficient queue length will be 
used in practical design problems. 
YI1. CONCLUDING REMARKS 

We have demonstrated how to use the _ proposed 


queueing model for evaluating the performance of 
shared-resource MSIMD computer organizations. The 
utilization factors for CU's, PE's and the system 
and the average job response time are used to es- 
timate the system throughput with respect to a 
given workload distribution. Our analytic results 
will aid the designers of MSIMD system to optimize 
the size of the PE resource pool and to determine 
the sufficient queue Length. Direct simulation 
study of MSIMD machines for parallel vector pro- 
cessing has been given in Ref. [9]. 

The CU and PE utilization factors can be furth- 
er upgraded if multiprogramming is built into each 
CU and each PE. Queueing discipline other’ than 
FCFS and LDSF can be also considered, such as the 
Shortest-Job~Serve-First (SJSF) discipline. The 
system degradation due to access conflicts and I/0 
overhead were not considered in our’ study. With 
sufficient large local memories in CU's and in 
PE's, the influence due to page faults or memory 
access conflicts should have some effect on our 
results but not significantly. An alternative ap- 
proach using two-dimensional Markov chains for the 
analysis of MSIMD machine is currently under 
further research by the authors for parallel vec- 
tor processing in multiprocessor systems. 
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Abstract -- Recent multi-microprocessor pro- 
jects have included general purpose as well as 
special purpose systems. Both types achieve high 
performance by using parallelism. But the design 
considerations of the structure of general and spec. 
purpose computers are yet closer related. Even if 
we take analog devices as the oldest special pur- 
pose machines into account, the essential structure 
of computations performed by both machine types is 
the same. This results from the most natural con- 
sideration of computations as motions in real 
Space and time. Two computer architecture. projects 
at the University of Erlangen-Niirnberg are descri- 
bed and discussed from this point of view: a gene- 
ral purpose parallel/associative hierarchical array 
of processors, EGPA , and a system of plug-in 
computer modules for special application space- 
sharing systems, DIRMU, both currently under con- 
struction. 


1. Introduction 


There are a number of recent projects in com- 
puter architecture whose common origin is the pro- 
gress in the technology of large-scale integrated 
semiconductor devices of the 1970's. The designers 
try to connect tens, hundreds or even thousands of 
processors and other computer modules into single 
high-performance systems. Some of these systems 
are intended to be general purpose computers, 
others for special applications. 


We are working on two such projects at the 
Institut fiir Mathematische Maschinen und Datenver- 
arbeitung of the University of Erlangen-Nurnberg: 
a general purpose parallel/associative hierarchi- 
cal array of processors EGPA (Erlanger General 
Purpose Array) [1], [2], and a system of plug-in 
computer modules for special application systems, 
DIRMU (Desinvolvimento e Implementagao de Redes de 
Multiprocessadores) [3], [4]. The latter is a 
joint project with the Universidade Estadual de 
Campinas, S.P., Brazil. 


For both general purpose and special purpose 
computers, an extensive use of parallelism is the 
usual way to achieve high performance at a given 
stage of circuit technology. But special purpose 
machines have evolved from a quite different star- 
ting point than general purpose computers. The 
oldest special purpose machines are analog devi- 
ces. Are there still other common structural fea- 
tures between analog and general purpose digital 
computers in addition to the (occasional) use of 
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parallelism? It is interesting to read a fifteen 
or twenty years old book on analog computers and 
to see that these were not only highly parallel 
but also, quite naturally, "data driven". 


If we consider computations as motions in 
real space and time - as explained in section 2 - 
we see that the traditional differentiation bet- 
ween the infinite data sets and continuous func- 
tions executed by analog function units on one 
hand, and the finite data sets and discrete func- 
tions executed by digital function units on the 
Other hand is not as significant as the fact that 
the essential structure of computations performed 
by both machine types is the same. The conceptual 
barrier between analog and general-purpose digital 
computers loses importance with the steady pro- 
gress of large-scale integration and the corres- 
ponding shift from time-sharing to space-sharing 
systems, where processing space is shared between 
tasks executed at the same time. This becomes evi- 
dent if we compare the automatic patching of analog 
function units in hybrid computers (cf. Mawson [5]}) 
and the dynamic reconfigurable or varistructured 
digital computers suggested recently (cf. Miller 
and Cocke [6], Arnold and Page [7], Lipovski [8], 
Lipovski and Tripathi [9], Kartashev and Kartashev 
Eto |): Multi-microprocessor DDA-systems (digital 
differential analyzers) dedicated to space-sharing 
solution of differential equations (cf. Korn [11], 
Kempken and Ameling [12]) are another example. 


In sections 3 and 4 we briefly describe our 
computer architecture projects EGPA and DIRMU and 
discuss the design decisions related to the struc- 
ture of computations. The present EGPA-system con- 
sists of five microprogrammable 32-bit processors 
AEG 80-60 configured in a way that allows multiple 
modes of parallel and associative processing. The 
pilot implementation of DIRMU modules uses the 16- 
bit microprocessors Intel 8086 and multiport memo- 
ries. The first application is a space-sharing 
configuration for a learning classification process 
used in pattern recognition systems (cf. Rohrer 


Pls) 


2. The structure of computations 


To compute means to obtain for some arguments 
x the appropriate results y: 


xr VY, 


Or, Otherwise stated, to execute some function 
(operation, map) f : X Y x ' y. If we hap- 
pen to find some natural process executing this 
function in real space and time, we are finished. 
This is a.very simple case of analog computation. 


> 


Examples of such processes are motions of 
wheels and levers in mechanical calculators and 
analog devices, or motions of electrical signals 
in operation units of an analog computer. We note 
that the execution of a boolean operationsuch as 
AND : {0,1} x {0,1} + {0,1} in switching circuits 
of a digital computer also relies on the physical 
analogue of motions of electrical signals through 
the corresponding gate. This is only a special 
case of analog computation where the argument and 
the value set are very small. 


If we do not find any natural analogue for 
the execution of our desired function f we have to 
search for other functions executed in an analogue 
manner which, appropriately composited, give us the 
function f. Fig. 1 shows an example of the gene- 
ral case of compound functions. 


Figure 1: Example of compound functions 


Directed graphs such as Fig. 1 are mathemati- 
cal tools which help us to see what happens when 
we compute. Compound functions are well known in 
mathematics. One can prove that all graphs of the 
type shown in Fig. 1 can be composited of the 
"primitive" graphs such as 


> 3 
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by using only two sorts of composition: parallel 
(essentially putting several graphs "horizontally" 
next to each other, such as"g" and "h", or as the 
multiple copies of "f" in Fig. 1), and in series 
(essentially putting two graphs one above another 
and connecting the nodes). The related mathemati- 
cal structures are called "algebraic theories", 
cf. Lawvere [14], or Arbib and Give'on [15] (es- 
pecially Theorem 2.4, p. 340), [16]; for the rele- 
vance of algebraic theories for the whole of com- 
puter science cf. for example Goguen, Thatcher, 
Wagner and Wright [17]. In [18], [19], a calculus 
of expressions for the classification of computer 
architectures (ECS: Erlanger Classification Scheme) 
has been proposed that reveals the same structure 
(cf. also a companion paper [20]). 


Now, speaking about the structure of computa- 


tions we mean first of all precisely the directed 


graphs such as Fig. 1. If we agree that "to com- 
pute" means to execute some functions, then any 
computation whatever, analog or digital, can be 
represented by directed graphs such as Fig. 1. 
This structure occurs at several levels in a given 
application. At the switching circuit level, the 
operations marking the nodes of Fig. 1 would be 
AND, OR, NOT, NAND, etc.; at the instruction level 
and the subroutine level each operation actually 
presents a composition and union of operations of 
the level one below, and so on. We call this the 


successive interpretation (Cte 12 


The extent to which the structure of com- 
pound functions is spatially reflected in real 
machines differs from level to level, and also 
from one specific machine to another since the 
“computation space" is related to the hardware 
cost. Fully spatial expansion appears often at the 
switching circuit level, where we get combinatio- 
nal circuits as a direct isomorphic image of the 
corresponding graph Fig. 1. At the instruction 
level, the classical Princeton-type computer ex- 
pands in space only one "programmable" node of 
Fig. 1 in its ALU, where the contents of the in- 
struction register defines the performed operation 
Big. ys Wigs. Geceus The execution of compound opera- 
tions has to be composited in the time coordinate. 
In contrast to this, Illiac IV, Cyber 76, and 
others expand at least certain special cases of 
compound operations spatially, such as cartesian 
power ("SIMD-machines") or cartesian product 
("MIMD-machines"). Some more recent attempts in 
this direction are seen in the "data-flow" or 
"data-driven" and "single assignment" computers 
(cf. Dennis and Misunas [22], Rumbaugh [23], Ar- 
wind and Gostellov [24], Plas et al. [25]). Opera- 
tion units of an analog computer are connected in 
a way that fully reflects the structure of com- 
pound operations in space. 


We can generally speak of the space/time tra- 
de-off in all machines. Parallel, (instruction-) 
pipeline, data-flow, single-assignment and similar 
computers use more computation space (i.e. opera- 
tion units) at the instruction level than conven- 
tional computers, in order to reduce computation 
time. Arwind and Gostellov explicitly speak of 
exchanging processors for time [24]. This is also 
a quite usual trade-off in analog computers (cf. 
Adler and Neidhold [26], Chapter 2.2.2). At the 


bit-operation level of digital computers, a well- 
known example from the beginner's logic design 
course is the parallel adder that requires more 
computation space but less computation time than 
the sequential adder. More advanced examples are 
arithmetic pipelines in the Cray 1 or Star 100 and 
its successors Cyber 203, 205, where the structure 
of compound operations such as floating addition, 
floating multiplication is fully expanded in space. 
Similar space/time trade-offs apply not only to 
hard-machines, but to soft-machines (programs) , 
too. Any experienced programmer knows that, in ge- 
neral, to make a program faster costs additional 
program lines, and vice versa. 


We note that the above considerations of com- 
putations and their structure are not at all new. 
In fact, they are so old and simple that some re- 
searchers seem to ignore them, in speaking of ac- 
tors, tokens and firing. 


3. General purpose space-sharing systems: EGPA 


Our considerations regarding the structure of 
computations hold also at the program level of the 
software and the corresponding PMS level of the 
hardware (cf. Bell and Newell [27]), as well as 
the related space/time trade-off. The PMS level of 
computer design became especially attractive in 
the 1970's, as the number of multi-microprocessor 
projects shows. We discuss the main design consi- 
derations of our projects at this level in the 
following. 


Fig. 1 here shows the operations (tasks) f,g, 
h, ... defined by different user program modules 
and their composition while they are executed. The 
usual single-processor general purpose computer 
essentially reflects only one "programmable" node 
of Fig. 1 in its processor, if we neglect I/O. The 
execution of compound operations such as Fig. 1 
has to be composited in the time coordinate; one 
speaks of time-sharing. By space-sharing we mean 
the case where the execution of compound operations 
is spatially expanded in the machine, as mentioned 
in the last section. In [28], the notion "macro- 
pipelining" was used in a similar context, with a 
general (non-linear) pipeline ("data-flow") at the 
PMS-level. Parallelism and linear pipelining at the 
PMS-level are only the expansion of special cases 
of compound operationsin processing space: the 
castesian power or product, and the function com- 
position f o g (f followed by g). 


The main problem in the use of space-sharing 
in general purpose systems is that the structure 
of compound operations in user programs changes 
from application to application. One possible so- 
lution is to provide for flexible interconnections 
between processors that can change dynamically. 
This is the approach taken in various projects of 
dynamically reconfigurable systems. The well-known 
drawback is the price of the interconnection struc- 
ture and its complexity which grows with the 
square of the number of processors. The complexity 
again causes additional problems in the computer 
organization. 


An alternative solution would reflect some 
approximation to the "universal" structure of 
compound operations in fixed interconnections, and 


then provide for dynamic projections of the actual 
application structures into the fixed structure. 
Fig. 1 suggests that local interconnections bet- 
ween processors should be sufficient in the most 
cases. The structure of the pool of locally inter- 
connected processors should take into account that 
cartesian power composition of operations - i.e. 
several occurences of the same operation to be 
executed on different data elements ordered in 
arrays - appear in many applications. The dynamic 
projection of the applications into the processing 
Space requires control overhead. To be consequent 
in space-sharing, the processing capacity for this 
task should be provided by extra processors. 


Fig. 2 shows the EGPA configuration sugges- 
ted in [2] that satisfies these requirements. A 


©) processor 


[ local memory 


Figure 2: EGPA configuration 


regular array of processors, each having its 
local memory block, is interconnected through 
multiport memory access from direct neighbours. The 
memory blocks store application programs for the 
processors and also serve as buffers for data flow 
which is implemented simply by exchanging of ad- 
dresses between neigbours. The pyramid over this 
array is the space-sharing implementation of the 
control hierarchy. Local interconnections through 
multiport memories are well suited here, too, so 
that the whole system is arbitrarily expandible. 
There is no need for special interconnection mo- 
dules, all main components are homogeneous and 
standard. 


This configuration allows the following three pro- 
cessing modes in the array controlled by the pyra- 
mid: 


- normal multiprocessing of independent tasks 

- cartesian power processing of (multiple copies 
of) a task ("PMS-parallel") 

~ general space-sharing processing projected into 
the array ("general PMS-pipeline") 


The use of microprogrammable processors and speci- 
fic microprogrammed instructions for "vertical 
processing" - a method of implementation of asso- 
Ciative processing on conventional hardware (cf. 
[29], [30], [31]) - allows a further three modes 
of parallel associative processing which are com- 
binations of the above modes with vertical proces- 
sing (cf [2]). 


The present EGPA system built in cooperation 
with AEG-Telefunken Konstanz consists of the 
"smallest pyramid", i.e. of a four processor array 
and a fifth processor as the control. The aim of 
this pilot implementation is to prove the viability 
of the concept and to gain experience with space- 
sharing systems across the related disciplines of 
computer science, as the following list of some 
design and implementation subactivities of the 
pilot project shows: 


hardware: 
1) interprocessor interrupts 
2) support I/O-module for vertical processing 
3) hardware measurement monitor 
firmware: 
4) instructions for vertical-associative processing 
operating systems: 
5) multiprocessing time-sharing operating system 
for the pilot system. 
performance analysis: 
6) measurements and analysis of system performance 
(cf. companion paper [32]). 
programming languages: 
7) Fortran-extension for vertical-associative pro- 
cessing 
application software: 
8) analysis of application algorithms with regard 
to their suitability for EGPA processing modes 
9) implementation of selected applications 


The pilot system uses the microprogrammable 
32-bit processors AEG 80-60. The local memories 
of the processors of the array consist of 64K by- 
teseach, and the local memory of the control pro- 
cessor has 256K bytes capacity. The system is due 
to be operable in March 1980. 


4. Special purpose space~sharing systems: DIRMU 


Considerations regarding the structure of com- 
putations and the related space/time trade-off (cf. 


section 2) at the PMS-level of computer design are 
yet more fruitful in the case of special applica- 

tions. A machine is dedicated to a particular com- 
putation structure, so that high-performance spe- 

cial-purpose computers have been able since the 
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1960's to expand this structure spatially to the 
extent that it was economically feasible (cf. 
Enslow [33]). The economic considerations have 
changed rapidly with the advent of LSI. Let us 
assume a user with a particular application, where 
a directed graph such as Fig. 1 shows the composi- 
tion of operations (tasks) defined by user program 
modules. As an alternative to the use of an(ex- 
pensive) general purpose computer, where the pro- 
gram modules would be executed in the time-sharing 
mode, the user can now purchase cheap micropro- 
cessor, memory, and support chips, join them to- 
gether in a way described by the graph of com- 
pound operations, and execute the tasks in the 
space~sharing mode. . 


However, the design of a large special-pur- 


pose system using LSI components is a fairly com- 
plicated and time-consuming task that has been 
performed only by teams of specialists from compu- 


ter manufacturers or universities. Even conventio- 
nal microprocessor system development for simple 
control applications today requires knowledge and 
experience in electronics, computer organization, 
and machine language programming from the designer, 
in addition to the knowledge of the desired appli- 
cation. If we want to make the above alternative 
directly available to the user, then it should not 


be more difficult to configure his space-sharing 


system than to link program segments together in a 
time-sharing environment. Another comparable task 
is the configuring of analog operating units at 
the patch-board of an analog computer. We remember 
that all computations, digital as well as analog, 


possessthe same essential structure. 


Thus the aim of the DIRMU project is to de- 
sign a system kit of plug-in LSI moduls for confi- 
guring user-definable high performance special 
purpose systems [3]. The principal shape of the 
DIRMU module should be no surprise after the above 
explanation. It corresponds to a single node of a 
graph such as Fig. 1. Fig. 3 shows a simplified 
notation for the module consisting of a processing 
unit, local memory for program storage and data 
buffering, and connections to other nodes. A com- 
plete space-sharing DIRMU system for pattern recog- 
nition, our first application described farther 
below, is shown in Fig. 4. 


Se FF 


Figure 3: DIRMU module 


processing unit 


local memory 


At this place we relate the DIRMU project to 
other similar projects. The micromodules of Cooper 
[34] should decrease the fabrication and debugging 
costs of logic circuits in the design of parallel 
and pipeline computers. Computer modules of Fuller, 
Siewiorek and Swan [35], and of Kartashev and Kar- 
tashev [10], as well as commercially available bit- 
slice microprocessors must appeal primarily to 


feed-back connection 
("learning") 


1G 
A 


pre- 
processing 


comparison 
and decision 


Figure 4: Example of automatic classification 
(pattern recognition) 


computer architects. The main objective of the 
DIRMU modules is their easy configurability, in 
order to aid the end user directly. 


Architecture of DIRMU modules. 


Considerations of desired system properties 
have led to the following requirements for the 
modules [3]: 


~ microprogrammable processor, 16-bit wordlength, 
interrupt capability 

- microprogram memory consisting of a read-only 
part for the implementation of the basic in- 
struction set, and a read/write part for the 
implementation of special-purpose user definable 
instructions 

- local working memory 

- provisions for easy reconfigurability of the 
system, e.g. through the use of multiport memory 
access for local working memory of each module 

- provisions for connection of two modules to form 
a Single module of wordlength 32 bit, if an en- 
hanced processing power of some modules of the 


system is required (similarly to bit-slice micro- 


processors) 
- each module having I/O-ports for communication 
with the outside world. 


The microprogrammability will make modules more 
flexible for adaptation to special processing 
modes (e.g. non-numeric processing, associative 
processing, cf. [29], [30], [31]). 


Pilot implementation of the modules. 


In order to gain initial experience and re- 
sults as soon as possible, it was decided to build 
pilot modules using standard microprocessor compo- 
nents. The 16-bit microprocessor Intel 8086 seemed 
to be the best momentarily achievable compromise. 
It receives good hardware and software support 
from the manufacturers and offers multiprocessor 
capabilities. Microprogrammability and some other 


features from the original catalogue of require- 
ments for the module had to be postponed. 


One of the most important properties of our 
pilot modules had to be an easy configurability of 
their internal structure, because of their experi- 
mental character. We have therefore chosen the 
system design kit Intel SDK-86 as the basic buil- 
ding element for our modules. It offers the micro- 
processor 8086 together with all support circuits 
needed for the "minimum-mode" system configuration 
(Intel) including 2K bytes RAM, parallel and se- 
rial I/O-interfaces, hexadecimal keypad and dis- 
play, 8K bytes ROM with system monitor for keypad/ 
display or serial-I/O machine code programming 
(teletype or CRT) and for serial program loading. 
A large area of the board is left for user's 
custom extensions. 


The structure of our module is shown in Fig. 
5. It is a dual bus system with a processor bus 
controlled by the CPU and a memory bus controlled 
by a specifically designed multiport controller. 
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Figure 5: pilot DIRMU module 


Processor part. 

The processor part consists of the 8086 with 
the clock and a modified wait-state generator of 
the SDK-86. The control bus has been extended 
through the use of the bus controller Intel 8288 
to the "maximum-mode" bus (Intel) that is neces- 
sary for multiprocessor operation. The module's 
private memory consists of 16K bytes EPROM and 4K 
bytes RAM for systm initialisation and monitor 
routines and tables. The original keypad and dis- 
play of the SDK-86 serve as a useful maintenance 


console of the pilot module for the inspection of 
register and memory contents, and for single-step 
execution during system testing and debugging. A 
CRT display terminal can be connected through the 
serial I/O-port to any module for the same purpose. 
Only the latter feature will be provided in the 
final DIRMU modules. The three parallel I/O-ports 
of the SDK-86 remain without changes. They allow 
interfacing a module to a mass storage device or 

a host computer. 


The processor requests access to its local 
working memory as well as to the working memories 
of its neighbours via bus buffers. The additional 
control signals needed to control the memory access 
that are not available in the 8086, are generated 
within the port-access control logic. These are 
the handshake signals REQ-i, RDY-i and ETRANS 
necessary to coordinate and synchronize memory 
access via the port i. The LOCK signal of the 8086 
to the multiport controller is needed when conse- 
cutive memory cycles are requested. 32 lines of 
the processor bus are used for separate 16-bit 
data and addresses. The port signals REQ-i are 
generated by decoding the uppermost four address 
bits of the 8086. Only eight access ports are used 
at present. 


Memory part. 


A multiport read/write memory with eight me- 
mory access ports is used as a local working memo- 
ry of each module. The port number zero serves for 
the access of the corresponding local processor, 
ports number 1 to 7 are used for the connections 
to the neighbour modules as required by the opera- 
tion structure of the application (cf. section 2). 
The working memory stores the application program 
code defining the operation of the module and also 
serves for data exchange. Word (16 bit) or byte 
data transmissions through the selected port and 
the corresponding memory bus is possible for data 
exchange. The access is controlled on a request/ 
grant basis by a specifically designed multiport 
controller. 


handshake signals 


REQ-7 * queue register 
LOCK-i 
ETRANS~1 control logic 
RDY-1 


ting processors are forced into 8086 wait states. 
The port access logic of the served processor ans- 
wers with an end-of-transfer signal ETRANS as soon 
as the memory access of the processor has finished, 
and the multiport controller grants memory access 
to the next processor selected by the priority ar- 
bitration logic. If the request has been accompa- 
nied by the LOCK signal of the processor - which 
can be forced by the corresponding prefix in front 
of any 8086 instruction-, consecutive memory ac- 
cesses are granted to this processor as long as 

its LOCK is active. The priority arbitration logic 
serves multiple simultaneous requests on the round- 
robin strategy. If only one access at a time is 
requested, the multiport access control remains 
invisible for the accessing processor. 


The 1M byte address space of the 8086 suggests 
the partitioning in 16 blocks of 64K byteseach, 
where one block is used for the processor's pri- 
vate EPROM and RAM and up to 15 blocks can be used 
for the working memories of the modules accessible 
by one processor, including its local memory. In 
our pilot implementation only 8 blocks of working 
memories are accessible by one processor leaving 
nearly half the address space unused. Any working 
memory of maximally 64K bytes capacity can be ac- 
cessed through one of its eight memory ports by 
different processors. The 20 bit addressing capa- 
city would allow modified configurations with lo- 
cal memories up to 128K byteseach or with increa- 
sed number of access ports of each processor. 


Programming considerations. 


We have called the local read/write memory 
block of a DIRMU module as shown in Figures 3 and 
4 (and accessed through the port @ in Fig. 5) the 
working memory of the module, since it serves not 
only as a module's buffer for data flow 
through the system but also for the storage of the 
application program code defining the operation of 
the module. This program code can be loaded by a 
serial loader program residing in the private. EPROM 
of the processor through the serial I/O-port. 


port control signals 


priority 
arbitration 


Figure 6: block structure of the multiport controller 


Fig. 6 shows the block structure of the con- 
troller. Multiple simultaneous requests to memory 
are saved in a queue register. The control logic 
activates the priority arbitration part that de- 
cides which request will be granted. The grant 
Signal GNT-i of the controller activates the cor- 
responding memory port (cf. Fig. 5) and at the > 
same time a ready signal RDY-i is sent to the pro- 
cessor that receives memory access. Other reques- 


Alternatively, a space-shared bootstrapping could 
be applied, where the application programs of all 
modules are loaded through the fixed system input 
(cf. Fig. 4) and spread over the system. If seve- 
ral modules perform the same operation (task) (e. 
g. moduled 1,2,3 and 4 in Fig. 4), they can use 
reentrant code located in their common direct suc- 
cessor. 
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A draw-back of the above solution is that in- 
struction fetch of a processor has to compete with 
the memory access of the neighbour modules to its 
local working memory. The port access logic of the 
processor part and the multiport controller of the 
module cause a delay that will require at least one 
8086 wait state even in case of a single access 
request at a time. We consider therefore also the 
solution in which a larger private memory of the 
processor stores the application code. EPROMSsS con- 
taining the code could be directly plugged into the 
processor board. 


Applications and Outlook. 


The DIRMU system is aimed as a modular system 
kit for user specified space-sharing implementa- 
tion of any application. The system configurabili- 
ty at the PMS level suggests first of all such 
applications where a clean operation structure at 
the corresponding software level can be recogni- 
zed easily (cf. section 2). 


Fig. 4 shows a space-sharing configuration 
for a learging classification process used in pat- 
tern recognition systems (cf. Rohrer [13]). It 
consists of pattern input and preprocessing follo- 
wed by four parallel classifiers 1,2,3 and 4. 
Their intemediate results are passed to a further 
classifier pair 5,6. Measurements of distances 
between given templates and the submitted pattern 
allow comparison and decision making for a possi- 
ble next interative step in which case a lear- 
ning feedback effect on the classifiers applies. 
These phases are performed and controlled by the 
next two modules in Fig. 4. The model has already 
been programmed and tested on a PDP 15 and found 
suitable for a space-sharing implementation [13], 


cae 


The pilot DIRMU system for this application 
is being built at our institute. Other applications 
include the feedback solution of ordinary diffe- 
rential equations (cf. Korn [11]) and the relaxa- 
tion method for the solution of partial differen- 
tial equatinos. In parallel to this pilot DIRMU 
project, design of an ultimate LSI DIRMU-module 
has begun. The following features are considered 
in addition to the design objectives mentioned at 
the beginning of this section: 


special instructions for data exchange such as 
broadcast and collect as well as instructions 
for processor intercommunications 

module synchronization/communication phase as a 
part of the basic system cycle containing the 
usual instruction fetch/execute cycle of each 
processor 

failure tolerant system provisions. 
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PERFORMANCE MODELING AND EVALUATION FOR HIERARCHICALLY 


*% 
ORGANIZED MULTIPROCESSOR COMPUTER SYSTEMS 


U. Herzog, W. Hoffmann, W. Kleindder 
Institute of Mathematical Machines 
and Data Processing (IIT) 
University of Erlangen-Ntrnberg, Germany 


Abstract -- We first overview the field of 
traffic theory, related problems, its methodology 
and its importance: performance modeling and eva- 
luation is needed from the initial conception of a 
system architecture to the daily operation of a 
computer system. a 


Secondly, we show how to model and evaluate 
the performance of hierarchically organized multi- 
processor computer systems. Most important: the 
flow of information depends on the hardware/soft- 
ware structure and the structure of the application 
programs, as well. We therefore classify applica- 
tion programs, discuss their implementation on 
hierarchically organized multiprocessor systems 
and describe the flow of information by a new class 
of queuing models. 


We finally evaluate some interesting examples, 
the results of which are obtained by exact or close 
approximate solutions. 


1. Introduction 


The field of traffic theory is to describe, 
to analyze and to optimize the flow of data and 
control information in computer systems. Perfor- 
mance characteristics, such as utilization, 
throughput and response time give information 
about the efficiency of the hardware/software 
structure and allow the detection of bottlenecks. 
Computer performance evaluation is necessary from 
the initial conception of a system architecture to 
the daily operation of a computer installation. 


This paper deals with the performance mode- 
ling and evaluation for hierarchically organized 
multiprocessor computer systems: There are several 
layers of computers, a connecting network between 
and within these layers, and a strict organization 
of the overall system. Hierarchical structures are 
transparent since we may concentrate coordination 
problems while distributing independent user tasks. 


Typical examples are the EGPA-project [3], 
the multiprocessor system at suNY [4], the Siemens 
system SMS [12], Mopps [15], X-TREE [2] and others. 


We first introduce traffic theory and dis- 
cuss typical problems we are interested in. 


Secondly, we show how to model and analyze 
hierarchically organized multiprocessor computer 
systems. As we shall see: The temporal sequence of 
events is determined not only by the structure and 
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operating mode of a multiprocessor system. Rather, 
it is heavily influenced by the internal structure 
of the application programs to be run on the 
system. 


We therefore classify the variety of appli- 
cation programs, consider their implementation on 
a given computer system and develop the corres-— 
ponding models. 


The quality of performance statements heavi- 
ly depends on the accuracy of our models: The more 
careful our modeling technique, the more reliable 
are the results. 


We therefore have to take into consideration 
the particulars of hierarchically organized multi- 
processor systems such as synchronization, data 
transfers, code and data sharing. Introducing a 
new class of queuing systems we show how to study 
the impact of these problems on the effiency of a 
system. 


2. General Remarks on Performance Modeling and 


Analysis (Traffic Theory) 


The field of traffic theory for computer 
systems is determined by the following three 
points 


- the investigation of service and transportation 
processes within a single processor, between 
several processors within the hierarchy, and 
between the hierarchy and I/O-facilities, 

-~ the definition and evaluation of characteristic 
performance values, such as utilization, through- 
put and response time, 

- the detection and removal of bottlenecks related 
to the structure and operating mode of the 
system. 


In doing so the overall objective is to design 
optimal economic structures and operating modes 
for 


~ prescribed performance under normal conditions, 
and 

- prescribed minimum performance under heavy load 
and failure conditions. 


Because our investigations were initiated 
and influenced by the EGPA-project, a 1.7 Million 
Dollar project at the University of Erlangen, we 
outline next the EGPA-architecture and related 
traffic problems. 


2.1 The EGPA-pyramid [3] 
The EGPA (Erlangen General Purpose Array) 
consists of a rectangular array of processors 


(A-processors) connected via multiport memories. 
Each processor may access its "own" memory, and 
the memories of its four neighbours in the north, 
west, south and east direction (figure 1). The 
edges of the rectangle are connected to form a 
toroid. In addition to the array-processors, there 
are processors dedicated to the transfer of data 
between the array-processors and the peripherals. 


At present, each of these processors, called 
boundary (B)-processors, is allocated to a 2 x 2 
square of the array. In addition there is a single 
processor, called control (C)-processor, which has 
a supervisory function. The pyramid structure is 
shown in figure 2. A pilot-pyramid is being imple- 
mented now.* 


Toroidal closing 


Processor Memory block 


Fig. 1: Erlangen General Purpose Array (EGPA) 


_ (Control) 


———@ connection in one direction only 


connections in both directions 


Fig. 2: Cellular structure of EGPA 


The EGPA-Project is supported by the BMFT, the 
German Ministry of Research and Technology. 


In analyzing the performance we may distin- 
guish between several levels of traffic problems. 
Dependent on the point of view we distinguish bet- 
ween the global behaviour of the system and the 
detailed description of specific subproblems. 


2.2 Macroscopic behaviour and related performance 


characteristics 


As stated above, the overall objective is to 
find optimal economic structures and operating 
modes for a prescribed workload of the computer 
system. We therefore have to investigate the effi- 
ciency of each individual component (processors, 
memories, ...), the interconnection scheme (commu- 
nication between components), and the efficiency 
of the global operating system (distribution of 
user task, assignment of resources, ...), as well. 


Typical performance characteristics which 
give insight into these global problems are 


~- system throughput 

- response time (mean and distribution function) 

- utilization of processors (idle state, user 
state, monitor state) 

- mean number of processors working simultaneously 

- overlap (concurrency) of resources 

- global synchronization of user tasks 

- utilization of storage units and channels 

- page traffic, etc. 


2.3 Microscopic behaviour and related performance 


measures 


Here we investigate the flow of control in- 
formation and data at a much more detailed level. 


First we may be interested in a detailed 
analysis at the user program level: 


- response time for individual application programs 

- percentage of overhead, data transfers, applica- 
tion processing 

- degree of parallelism. 


Secondly, we are interested in a detailed analysis 
at the task level: 


- number of tasks, interarrival and execution 
times 

- number of task interruptions and related over- 
head 

- degree of parallelism 

~- influence of the global and local operating 
systems (coordination, synchronization, etc.) 


Thirdly, we are interested in a detailed analysis 
of data transfers between several layers of the 
multiprocessor system and background storage, as 
well: 


- number of blocks, interarrival times and block 
length, 

- queue length and traffic bottlenecks, 

- proper design of buffers, etc. 


Last but not least a very important area is the 
investigation of memory interference and related 
conflicts, especially when several processors may 
access the same storage unit. 


2.4 Performance analysis 


There are three important methods available 
for the investigation of traffic flow [7]: 


1) The measurement of real system behaviour by 
special software and hardware components, 

2) The simulation of the structure and operating 
modes by means of special programs or simula- 
tion languages, 

3) The mathematical methods which allow the exact 
or approximate investigation of the traffic 
flow in components or in the overall system. 


The main point of measurement is that we 


obtain real data of existing systems, and we obtain 


results to check our initial assumptions of simu- 
lation and mathematical methods. However, measure- 
ments which do represent the actual system beha- 
viour are rather expensive and time consuming. And 
what about new system structures and operating 
modes, not yet implemented? 


The main advantage of simulation compared to 
the mathematical methods is twofold: A very de- 
tailed modeling and evaluation of the actual traf- 
fic flow is possible. And new problems may be in- 
vestigated in relatively short and well defined 
time intervals. However, the problem of system 
engineers is not only the analysis but also the 
synthesis of computer systems. And real optimiza- 
tion at reasonable expense is only possible by 
means of the mathematical methods. 


This summary shows that each of these 
methods has its advantages and disadvantages, as 
well. We therefore may find all three methods in 
various stages of the design process of a computer 
system. 


Up to now we discussed the scope of traffic 
theory in general and by means of examples. Let 
us complete these remarks in summarizing the main 
steps of the traffic theory: 


1) Critical analysis of existing computer systems 
and existing tools for performance evaluation 

2) Modeling of computer systems and its hardware/ 
software components 

3) Analysis of the performance by means of simula- 


tion and exact or approximate mathematical tools 


4) Validation of the above models by system-measu- 
rements 

5) Synthesis of optimal components and an optimal 
overall design. 


As mentioned above, we focus our attention 
to the problem of modeling. The other subjects, 
however, are briefly touched and related publica- 
tions cited. 


3. Modeling of Hierarchically Organized Multipro- 


cessor Computer Systems 


3.1 Methodology 


The flow of information in a multiprocessor 


computer system depends on the hardware components, 
the interconnection scheme and the operating system. 


It is, however, heavily influenced also by the 
internal structure of the algorithms implemented 


in the application programs. 


Our modeling methodology therefore includes 
the following steps: 


1) Classification: we classify the variety of 
application programs using directed graphs. 

2) Implementation: we consider the implementation 
of these application programs on hierarchically 
organized multiprocessor computer systems 
taking into consideration the operating mode 
(monoprogramming and/or multiprogramming) . 

3) Modeling and evaluation: we develop and analyze 
the corresponding queuing models taking into 
account the particulars of the multiprocessor 
system (hardware/software structure) and the 
application programs, as well. 

4) Characteristic workload: application programs 
in pure form (one type only) may occur in 
Special purpose multiprocessor systems. For 
general purpose systems, however, a mixture 
of all program structures is typical. We there- 
fore have to unify several models (combined 
model) and choose the model parameters accor- 
ding to a characteristic workload. 


In the following sections we outline step 
one, two and three. Although we have not investi- 
gated yet the combined model, the analysis seems 
to be straight forward. From our point of view it 
seems to be more difficult to find a characteris- 
tic workload and to choose the model parameters 
accordingly (cf. final section). 


3.2 Classification of application programs 


Following the work of Adams [1] and others, we 
describe a program by means of a directed graph, 
the nodes representing subtasks (well defined 
functions or sets of functions), the edges showing 
interdependencies and representing data buffers 
(unlimited FIFO-queues). Nodes (subtasks) are per- 
formed if and only if each input edge to this node 
contains at least one datum. 


e Type-i-program structure 


The type-1l-program consists of a loop which 
may be passed several times, cf. figure 3. Within 
that loop n independent subtasks can be distin- 
guished (there may exist some pre- and postpro- 
cessing). A new loop-cycle may be started if and 
only if all n independent subtasks are completed. 


Problems are often of this type: algorithms 
for the solution of linear-algebraic or partial 
differential equation systems, optimization proce- 
dures, simulations including subruns for the pur- 
pose of estimating confidence intervals, problems 
of picture processing, etc. etc. 


Validation example: in order to validate our 
performance modeling and evaluation technique (cf. 
also section 4) we investigated carefully the 
travelling salesman problem, a special optimization 
procedure: 


-~ A salesman has to visit each of s given cities 
once and only once. What tour should he choose 
in order to minimize the total tour cost? 

- We can solve this problem by so-called 3-optimal 
tours [14]: 


Fig. 3: Type-1l-program structure 


Starting from a randomly chosen initial tour the 
algorithm tries to get an improvement by repla- 

cing three links by another combination of three 
links. This basic operation will be done until Fig. 4: Type-2-program structure 
no more improvement can be achieved. Because a 

3-optimal tour is not optimal, we produce some 

3-optimal tours and take the best tour as solu- 
tion of the problem. 


Obviously, this algorithm may be implemented 
as a type 1-program structure: first we may gene- 
rate n (randomly chosen) initial tours (subtask SQ) - 
Starting with these initial tours we search in 
parallel and independently for 3-optimal tours 
(subtask s,, 1 S$ i Sn). Having finished all paral- 
lel subtasks we compare these 3-optimal tours and 
select the best one (subtask Sua. 
® Type-2-program structure 


Here, the program also consists of a loop. 
However, the n subtasks influence each other in 
some way (figure 4 shows one possibility) rather 
than being completely independent. Compared to 
type-l-program structure there are not only global 
but also local synchronizations necessary. 


@ Type-3-program structure 


For many application programs the degree of 
parallelism varies rather than being constant. Such 
program structures are classified as type-3-programs, 
an example of which is shown in fig. 5. 


with varying parallelism [13] 
These program structures consist of comple- 


tely independent tasks tyr sees a (cf. fig. 6a). Remark: 

In addition there may be some preprocessing t. and 

a feedback loop (fig. 6b). However, no synchroni- 

zation is necessary between t,, ..., t_ since they 
n 

are completely independent. 


It should be noted that the travelling sales- 
man problem may also be implemented as a type-4- 
program. Then, however, subtasks t,, ..., C. 
correspond to the parallel but independent search 
for one 3-optimal tour, i.e. each subtask tries 
to improve a rotation of the best tour known at 
that moment. 
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Fig. 6: Type-4-program structures 


3.3 Implementation and modeling for a _two level 
hierarchy 


e Type-l-program structure 


Type-l-programs may be implemented very effi- 
ciently on a hierarchically organized multiproces- 
sor system with two levels (cf. figure 7): 


B - processor 
(level 1) 


A - processors 
(level 2) 


Fig. 7: Hierarchically organized multi- 
processor system with two levels 


- At first the B-processor starts the program. 

- Then the B-processor initiates the execution of 
n independent subtasks by the A-processors. 

- Having completed its subtask, each A-processor 
has to inform the B-processor. 

- Postprocessing and preparation of a new loop- 
cycle by the B-processor is only possible when 
all subtasks are completed (i.e. synchronization 
is necessary). 


The corresponding timing-diagram is shown 
in figure 8, And, again, it is worth-while to 
interprete it for the travelling salesman problem. 


A queuing model which allows us to describe 
and analyze the traffic flow including the above 
synchronization problem is shown in figure 9 
(synchronization is shown symbolically by a hori- 
zontal bracket). 
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Fig. 8: Execution of a type-1-program on 
a two-level hierarchy 


(eS 


B - server 


Fig. 9: Queuing model for type-1-programs 
and monoprogramming 


Monoprogramming 


For reasons of simplicity and transparency 
we first assume monoprogramming for both B- and 
A-processors: 


- Newly arriving demands (source programs) are 
buffered in the input queue. 

- If the "inner" system is empty the first demand 
is processed by the B-processor. 

- The B-processor generates n independent sub-de- 
mands and distributes them simultaneously among 
all n A-servers (more sophisticated transfers, 
cf. section 3.5). 

~ After completion, each sub-demand is buffered 


in the corresponding input queue of the B-server. 


If all n sub-demands are buffered, they are re- 
moved simultaneously (symbolized by «Wi ) from 
the n parallel input queues and processed in one 
step. 

~ After completion there are two possibilities: 
the (complete) demand leaves the system or n new 
sub-demands are generated simultaneously anda 
new loop-cycle is started. 


Multiprogramming 


If we allow multiprogramming for both B- and 
A-processors several programs may be interleaved. 
In principle, the corresponding queuing model is 


Similar to that of monoprogramming. However, queues 


may build up in front of the A-processors, too. 


Mixed multi- and monoprogramming 


Multiprogramming allows us to increase system 


throughput. However, for reason of simplicity and 
transparency of the operating system there is a 
trend to introduce monoprogramming again. For many 


applications a mixture of both seems to be an effi- 


cient solution: multiprogramming for the B-proces- 
sor and monoprogramming for the A-processors. Fi- 

gure 10 shows the corresponding queuing model and 

is rather self-explanatory: 


@ 


© 
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Assume a number m of independent demands 
(t,,.-.,t,,.-..,t_) to be served sequentially (!) 
on the B-server. After completion each task t., 

(i = 1,...,m) generates n, independent subtasks to 
be processed by the reserved A-processors A. to 
A,_ . Task t, may be resumed if and only if all 

i subtasks have been completed by the A-proces- 
sors. If the B-server is busy, complete demands 
wait in front of the server and are served in the 
order of arrival (FIFO). 


Note, synchronization is only necessary 
between sub-demands belonging together, an impor- 
tant fact for analysis. 


Other program structures 


We illustrated the process of modeling by 
means of type-l-program structures. It is straight 
forward to develop the queuing models for the 
other program types and a mixture of several pro- 
gram types, too. The variety of these queuing 
models is presented in [8], two more examples are 
shown in fig. 11. We rather deal now with three- 
and multi-level-hierarchies. 


3.4 Performance modeling for three- and multi- 
level—-hierarchies 


As before we do have to analyze the struc- 
ture of the application programs to be run on the 
hierarchy and we do have to decide how to implement 
the algorithms (a detailed description how to im- 
plement the travelling salesman problem may be 
found in [9, 10], numerical results are shown in 
section 4.3). 


In modeling the traffic flow we have to 
consider the following most important point: 


~ It is rather unwise and probably unsuccessful to 
describe and analyze the complicated flow of 
information and all interdependencies by a single 
global queuing model. 


@) 


[=|--- [=| 


- B - server 


Fig. 


10: Queuing model for mixed mode 
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An efficient and often successfully applied 
technique (often referred to as hierarchical mode- 
ling technique [7, 11]) is nothing else but a 
systematically structured top-down modeling and a 
bottom-up evaluation. 


We decompose the total system in hierarchically 
structured subsystems of manageable size and 
develop the corresponding models: a macro level 
model, several medium level models, etc. etc. 

We analyze these models individually either by 
Simulation or mathematical tools. 

Starting at the lowest modeling level we deter- 
mine "local" performance values, embed them into 
the next higher modeling level, etc. etc., up to 
highest level, the macro level model. 


Figure 11 illustrates this technique, applied 
to a three-level hierarchically organized multi- 
processor system (numerical results and their 
validation are discussed in section 4.3): 


First, we investigate the interactions bet- 
ween the top processor C and all subsystems SSI, 
-.-,99n and develop the macro-level model. Obvious- 
ly, the duration of service by these subsystems is 
not yet known. We therefore have to go "one level 
deeper", investigate the interactions within each 
subsystem (one B- and several A-processors) and 
develop the corresponding models (which may be of 
the same or different type), etc. etc.* 


In evaluating this hierarchy of models we 
have to proceed in the other direction: we start 
with the lowest level models, determine "local" 
performance values which reflect the duration of 
service at this level, and embed these results in 
the next higher level of models, etc. etc. Finally 
we reach the top level and obtain global perfor- 
mance values such as throughput and response-time 
of the total system. 


3.5 Refinement of models 


Up to now we have neglected some important 
features of many real computer systems in our 
examples: 


the initialization of subprocesses often takes 

a considerable amount of time 

application programs and data have to be trans- 
ferred between foreground and background memories 
synchronization between subtasks causes inter- 


rupts and overhead for the top processor 
etc. etc. 


It is possible to include these phenomena in our 
modeling technique, too, as demonstated next by 
one example. 


® Initiation of subprocesses 


Up to now we have assumed negligible time 
intervals for the initiation of subprocesses. In 


Fig. 11: Modeling of a three-level hierarchically organized computer 
system by one macro-level model (SSO) and n medium level 


models (SS1,...,SSn) 


(SS1 corresponds to a type-4-program 


structure while SSn allows us to model type-3-program 


structures). 


Structured submodels for each individual proces- 
sor are well known from literature [11]. 


109 


practice, however, there may be a considerable 
amount of time necessary for the preparation and 
transfer of control information, user programs 
and data. | 


From our experience with the EGPA-project 
[17], these details may be captured by the follo- 
wing operations (cf. timing diagram, figure 12): 


~-~ After some common preprocessing the B-processor 
initiates the first subprocess: code and user 
data are loaded to the memory of processor A,; 
additionally there may be an individual prepro- 
cessing for this subprocess necessary, hand- 
shaking, etc. 

- Processor A, may now start to process its sub- 


task and the above procedure is repeated for pro- 


cessors A A. to An? 


23 

The corresponding queuing model is the same 
model as in the case without individual phases for 
starting. However, we now interpret the behaviour 
of the B-server in another way: 


In the simple model, the B-server removes 
simultaneously from all input queues one sub- 
demand and processes them in one step. Now, the 
B-processor also removes from each queue one sub- 
demand, but it serves them sequentially. And there 
is some overlap between the B-processor and the 
A-processor to be started next. 


Phases for starting 


ss 


Interarrival and service times are interpre- 
ted as random variables with a given distribution 
function. The usefulness and validity of such an 
approach has often been proven (cf. [11]) and 
efficient methods are available to determine type 
and parameters of distribution functions [16, 18]. 


All models for two-level hierarchies presen- 
ted above have been investigated under various 
assumptions [10, 19, 20]. 


~ First we analyzed the models under Markovian 
assumptions, i.e. for reason of simplicity we | 
assumed exponentially distributed service times 
for both, B- and A-servers (the mean value for 
different A-servers may vary). 

~- Since service times are most often non-exponen- 
tially distributed we then relaxed these assump- 
tions and analyzed the models for Erlangian, 
hyperexponential and even generally distributed 
service times. 


Fig. 12: Timing diagram with phases for starting an 


A-processor 


4. Analysis of Some Models and Numerical Results 


4.1 Remarks on the probabilistic analysis 


The flow of information within computer 
systems is determined by 


- the transportation of data and control informa- 
tion between subsystems 

- the processing of data by processors and control 
units 

- the blocking and waiting at various locations 
within the computer system. 


The sequence of events of such transporta- 
tion-, waiting- and processing-times is at least 
partially deterministic although the interdepen- 
dencies are rather complex. From the point of view 
of a subsystem, however, the sequence of events 
seems to be random. In other words, the sequence 
of events may be described by means of stochastic 
processes: 


In analyzing the performance we usually assu- 
med stationarity, determined then state probabili- 
ties and characteristic performance values such as 


- system throughput 

~ server utilization individual for each server 

- mean numbers of A-servers working simultaneously 

- mean numbers of demands and/or sub-demands wai- 
ting in front of servers (queue length) 

- mean synchronization time, i.e. mean time between 
the moment when all related sub-demands start 
and the last one is completed 

- mean cycle time for complete demands, the sum of 
synchronization time and processing time at the 
B-server 

- distribution functions for both the synchroniza- 
tion and cycle time. 
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4.2 The model for type-1l-program structures and 


mixed multi- and monoprogramming 


Be given a model structure according to fi- 
gure 10, the service discipline of which being 
described in section 3.3. For reasons of clarity 
service times for both B- and A-servers are assumed 
to be exponentially distributed with 


Uh : service rate for the B-server 
Reice : service rate for the A, .-server, 
1) 13 
1, © 41,2;sc35m) .5°e {1,2,...,n,} 

m : number of complete demands D. (competing 
for the B-server), i.e. degree of multi- 
programming 

n, : number of parallel sub-demands belonging 


to demand D.. 


Analysis is performed under equilibrium con- 
ditions, i.e. stationarity is assumed. 


- Decomposition: Recall, again, that synchroniza- 
tion is only necessary between sub-demands S, 

3 € {1,2,...,n.} together. So, if we are able leé 

analyze all paeividua synchronization processes, 
the overall behaviour is described exactly by the 
following mode, known from the model for type-4- 

programs (cf. figure 11, model for SS1): 


B-server: service times exponentially distributed 
with rates uw as in the complete model. 
A-server: service times generally distributed with 


different service rates A'. according to 
the d.f. of the synchronization times 
F,(t) and its mean E,|T 
(definition cf. 4.1) 
of the A-servers. 


. No queues in front 


We therefore attack first the (individual) 
synchronization problem also being the solution 
for the monoprogramming model shown in figure 9. 
Then, in a second step, we briefly outline the 
solution for the model for type-4-programs, a 
generalization of the so-called repairman-model. 


e Synchronization: Let D, be the complete demand 
the synchronization time T._ of which has to be 
determined. Processing of all sub-demands SG.., 

j € {1,2,...,n.}, starts at the same instant? 
Suppose all processing times are exponentially 
distributed with uniform service rates As, = he: 
A, (solution for different rates, cf. [55 


Now, T. is the maximum of n, service times 
a. .1 3 € {172,...,n,} (cf. figuré 13), and there- 
1J fore the distribution function F,(t) of the 
synchronization time is the product of the distri- 
bution functions Fy 4 (t) of the service times T 


T 


ee 


</> & 1 
(1 Sj Sn,). J 
F..(t) =1- a°e! ge G2 4 =n, 
ij i 
we obtain n.- 


=Ajit, ni; 
= (4 ~@ t a 


F; (+) = P(T.¢t) = = || nice 


Jj=4 


i € {1,2,...,m} 
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with mean and variance 


A 4 
t K=4 K 
A Bs A 
VAR; (Te |] = Ses 
iCsI= aad 


Fig. 


13: The synchronization time T 


iS) 


# Overall behaviour: Since the synchronization 
problem is solved now the overall behaviour is 
completely determined by the model for type-4- 
programs, presented in fig. 11, SS1 with no queuing 
in front of the A-servers and where the service 
rate of these m A-servers is 


i © {1,2,...,m}. 


1 

ETT] 
We first analyzed this model under Marko- 

vian assumptions: system states were described by 
a (m + 1)-dimensional vector, stationarity was 
assumed and the explicit solution derived for the 
stateprobabilities, a generalization of the well 
known M/M/1/m-solution. 


Secondly, we proved the solution also being 
valid in case of general service time distribu- 
tions for the A-servers. The detailed analysis 
may be found in Erene some characteristic perfor- 
mance values are: 


Utilization of the B-processor YR 
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4.3 Numerical results for two-level hierarchies and 
a three-level hierarchy 


Figure 14 shows the mean cycle time t_ for 
complete demands on a two-level hierarchy as a 
function of: 


1) the program (and therefore model) structure: we 
may have one demand with four sub-demands to be 
synchronized (upper curve), we may have two in- 
dependent demands with two sub-demands each 
(middle), or we may habe four independent de- 
mands (lower curve). 

2) the mean service time i/u of the B-server 


Furthermore, it is assumed that the service 
rate A is equal to one for all sub-demands. In 
order to compare the results for the same load 
per cycle of the B-server we introduced different 
scales. The diagram shows clearly how the type of 
program influences: the cycle time. 


| 


Figure 15 shows a three-level multiproces- 
sor computer system and the flow of information 
if we run the travelling salesman problem on such 
a configuration (cf. also section 3.4). 


We developed three approximate procedures 
by decomposing the system in several subsystems. 
Some results are shown in figure 16 and compared 
with a step-by-step simulation of the complete 
3-optimal-tour-algorithm. In other words: we simu- 
late the algorithm as it would be performed on a 
real three-level multiprocessor system (since the 
algorithm starts from a randomly chosen initial 
tour, the execution time varies even for the same 
number n of cities: we therefore performed the 
algorithm ten times for each problem). 


Comparisons show that the very simple pro- 
cedure i (expon. assumptions) tends to overesti- 
mate the execution time. Procedure 2 (Erlangian 
d.£. with constant variance) yields, for a small 
number of cities, results which are somewhat too 
optimistic. Finally, the most sophisticated pro- 
cedure 3 (Erlangian d.f., variance properly ad- 
justed) yields always accurate results. 


5. Summary and outlook 
The intension of this paper is twofold: 


- First, we tried to overview the field of traffic 
theory, related problems, its methodology and 
its importance: performance modeling and eva- 
luation is needed from the initial conception 
of a system architecture to the daily operation 
of a computer system! 

- Secondly, we showed how to model and evaluate 
the performance of hierarchically organized 
multiprocessor computer systems. Most important: 
the flow of information depends on the hardware/ 
software structure and the structure of the appli- 
cation programs, as well! 
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Fig. 14: Examples for the mean cycle time Ey on a two-level hierarchy 
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level 1 


level 2 


JOUOjOOUU UUW WLLL level 3 


Fig. 15: Three-level multiprocessor computer structure 


16: Execution time t for 
the travelling salesman 
algorithm on a three- 
level hierarchical 
multiprocessor computer 
system. Comparison bet- 
ween results from a 
step-by-step simulation 
of the algorithm (simu- 
lations x, mean value @) 
and three approximate 
solutions (proc. 1 o—o, 
proc. 2 A—A, proc. 3 
C—t1). Results are 
Shown for n = 10, 16, 

33 and 42 cities (cf. 
text). 
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Work is going on. At the moment our research 


goes into two directions: 


- Beside general distribution functions for service 
times you may find in our models exponential 
assumptions,too. We are trying to extend our 
results by introducing more general distributions. 

- Secondly, we try to refine our models: transfer 
times, individual initiation of subprocesses 
and priorities are some of the most important 
examples. 


From these remarks you may see that there 


are still many important and challenging problems 
to solve in the area of performance modeling and 
evaluation for multiprocessor computer systems. 
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Performance bounds for a certain class 
of parallel processing 


Norman L. Soong, Sperry Univac 


A major reason for the exploration of 
parallel computation is to improve computing 
system performance. A family of absolute 


mathematical upper 
improvement is 


bounds for system performance 
discovered and presented in this 


paper. Of all the upper bounds, there is a least 
upper bound, 1.u.b. The mathematically derived 
l.u.b. is then applied to a set of empirically 


obtained performance improvement data. The fit is 
more than casual. The result is also compared to 
a Similar one obtained at the Stanford University. 

Processes, processors, and tasks are the 
fundamental concepts of this paper. A process is 
a progression of computing machine operations. A 
processor is an entity capable of carrying out 
operations. Represented by a collection of 
processes, a task is mapped onto a collection of 
processors to be computed. Four basic assumptions 
are fundamental to this study: 

* MIMD organization, 

* identical individual processors, 

* inexhaustive supply of processors, 

* unit operation execution time. 


The term speedup (Sp) is related to task 
performance gain. Sp is defined as the ratio of 
the time needed to compute the task on a 


uni-processor system to that of a multi-processor 
system. Let 'p' be the maximun number of utilized 
processors to compute this task, then Sp=T1/Tp 

is commonly adopted as the definition of 
performance gain. 

Let one unit of work be defined as_ the 
accomplishment of one processor operating for one 
unit of time, then the ith-work partition (Ai) can 
be introduced as the percentage of work done by 
i-processors for this task. It can be shown that 

Tp = 2 Ai * T1/ i 
is the expression for Tp. Substitute Tp into Sp, 
the fundamental relation for Sp is shown as 
Sp=i1/(C(2 Ai /i) for i=1,2,...p. 
1/p 

Sp < (1/p)*(p!/71 Ai) where p 

integer, and AiZ4O for all i. 

1/p 
Theorem 2 Let F(Ai)=(1/p)*(p!/7 Ai) for integer 
p, then the optimal minimal function of 


Theorem 1 is an 


F(Ai) is such that all Ai's are equal 
and subject to the constraint’ of 
ZT Aizl. 

The significance of this paper is the 


discovery of a class of mathematical upper bounds, 
theorem 1, and the identification of the l.u.b. 
of this class, theorem 2. To verify the validity 
of this l.u.b., it is applied to a set of 
empirical data. This set of data is derived from 
approximately 200 programs. According to their 
appllications, they are organized into seven 
groups to illustrate their average Speedup 
characteristics. Six groups fall under the l.u.b. 
and one is 5 percent above the l.u.b. 

, A similar effort has been carried out at the 
Stanford University. A definite inequality is 
derived instead of the semi-definite inequality of 
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ensemble performance improvement 


ensemble bound. 


theorem 1, and this approach has led to a family 
of not-so—sharp upper bounds. Two data points are 
above the Stanford University bound by a 
significant margin of twenty percent and sixty-one 
percent respectively. 

The mathematical derivation of the l.u.b. 
consists of a formal application of the geometric 
inequality to a speedup definition. If the l.u.b. 
has any intrinsic meaning at all, then it must be 
contained originally in the speedup definitions. 
A study of programs in the time-—processor product 
Space suggests that the parallelism extracted at 
the statement level is the prime contributor to 
performance improvement. Another Significant 
evidence is offered by the reasonable agreement 
with the available data points. These evidences 
suggest that the l.u.b. bounds the maximum 
extractable from 
ordinary programs by exploring the statementwise 
parallelism contained in a program. An optimizing 
FORTRAN compiler is a typical example of this type 
of automatic performance extractor. 

Emphasized in the above claim is the 
implication that the bound of Theorem 2 is an 
Both the Sp-—-definition and the 
available data points are averaged ensemble 
quantities. It is possible to have it running on 
a machine a single program or even aclass of 
programs, whose performance can be improved beyond 
the l.u.b. Yet the contrary is also true that 
there exist programs and classes of programs, of 
which little improvement can be made to their 
performances. It is the average of all those 
Situations that is the quantity called ensemble 
Speedup, or ensemble performance improvement. 
This is the quantity that is bounded by the l.u.b. 
The effectiveness of the l.u.b. is related to the 
number of data points in the experiment. The fit 
will improve with more data points. 


Sp 


Figure. The Ensemble Performance Bound 


A BLOCK-ORIENTED SPARSE EQUATION SOLVER 
FOR THE CRAY-1 


D. A. Calahan 
Department of Electrical and Computer Engineering 
University of Michigan 
Ann Arbor, MI. 48109 


Abstract 


This paper investigates the impact of a cur- 
rent vector processor architecture on the algo- 
rithms of a sparse direct equation solver. 

Timing models of a block-oriented solver imple- 
mented on the CRAY-1 allow an evaluation of the 
overhead of a general solver on such a processor. 
It is shown that a general solver can outperform 
even dedicated solvers written conventionally in 
Cray assembly language. 


oparse Equation Solvers 


General sparse equation solvers are compli- 
cated mathematical software packages which solve 
directly (contrast iteratively) simultaneous 
linear equations having an arbitrary off-pivot 
sparsity structure [1] - [2]. This structure is 
described either by a linked list [3] or a bit 
map [4]. Such solvers have been used for more 
than a decade as kernels in (1) the implicit 
solution of algebraic and ordinary differential 
equations associated with lumped physical systems 
such as electronic circuits, electrical power 
systems, and 3-D mechanisms, and (2) the solution 
of discretized partial differential equations 
(PDE's) such as arise in oil reservoir simulation. 

This paper examines the characteristics and 
applications of such solvers implemented on a 
memory—-hierarchical vector processor, the CRAY-1. 
Because the architecture is fixed, a "bottom-up" 
approach is used, wherein the performance of ap- 
propriate numerical kernels on the CRAY-1 is 
studied first. These kernels are shown to have 
a strong preference for block matrix structure. 
This single fact profoundly affects the develop- 
ment of the solver and its applications, dis- 
cussed in the remainder of the paper. 

Because this study is directed toward a 
specific processor, it is first well to summarize 
three general issues which implementation of such 
a code on any vector processor are likely to 
involve. 

(a) Algorithmically, vectors arise from 
either local matrix density (coupling of vari- 
ables) or from symmetries in globally decoupled 
parts of the problem structure. The former will 
be considered in this paper; the latter is dis- 
cussed in [5]. 

(b) The coding of such a solver will pro- 
foundly affect its efficiency; this coding in 
turn is intimately related to the data flow of 
the processor, in part due to the linked list or 
bit processing. This implies a need for coding in 
a low level language and a consequent high degree 
of machine dependency. The CRAY-1 was chosen for 
this study due to its superior scalar and short 
vector performance. Linked lists rather than 
bit maps were adopted because of the CRAY-L's 


limited bit controllable vector operations. 

(c) The performance evaluation involves 
construction of a timing model so that the price 
paid for using a vector processor and a general 
equation-solving code can be evaluated by a poten- 
tial user. In this study, the general solver will 
be shown to execute faster than codes written for 
specific matrix structures and conventionally 
coded in CAL (Cray Assembly Language). 


Block Equation Solvers 
Kernels of Sparse Solvers. The proposed general 


sparse equation solver operates on linked list 
descriptions of the matrix structure. Because 
vector operations utilizing linked lists (termed 
"sathering" and "scattering") commonly proceed at 
1/5 to 1/10 the speed of linearly-indexed array 
computation, it is important for efficient vector 
operation that the list not point to a single 
non-zero element. Assuming significant local 
matrix coupling, the list should point to either 

(a) a dense segment of a row or column, or, 
more generally, 

(b) a dense block 
of the matrix. With sufficient numeric computa- 
tion resulting from this density, the index list 
processing can be overlapped or at least over- 
whelmed by the numeric computation. 

Consider the factorization of a matrix A into 
upper and lower triangular forms U and L respect- 
ively. The numeric kernel is commonly of the 
form 


De Da = BEC. * (1) 
i i "a 

where D., B., and C. are submatrices from the 
overall sparse systems matrix. The accumulation 
of the column segments of (a) will be termed a 
line-vector operation of the form 

Pea = fay ae 7 Up KG: 4,2 (2) 
where A, , and L. |, are dense column segments 
of the Hater A ana’ thé triangular factor L, ex- 
tending from row i to row j, and Ug ok is an 
element of U (Figure 1(a)). A block-vector opera 
tion can be written (Figure 1(b)). 


Boe i << : - LL... U 
ct) 3s CL se eee Lj gkee? hit org 
(3) 


with similar subscript notation to indicate ini- 
tial and final row and column indices. 

An approximate timing model for the line- 
vector accumulation loop on the CRAY-l has been 
found by simulation to be 


MFLOPS = | 1 (4) 


se ec Pecy 


where & is the average length of a dense column 
segment encountered during the accumulation. 
This model, valid when all vectors are longer_ 
than 14, achieves a maximum rate of 35.8 for 2 = 
64, the maximum vector length on the CRAY-1. 

The mathematical model for the vector block 
accumulation of (3) is given in Figure 2 as a 
function of block dimensions. The asymptotic 
rate of this kernel is 151 MFLOPS for q = 64 (the 
vector length) and p and q large. 

The ratio of asymptotic execution rates 
(151/35.8 = 4.2) can be traced directly to the 
memory bandwidth required to implement the line- 
vector accumulation of (2), where an element of 
Aes and Li;j3,9 must be loaded and Aj:3,% 
stoped for each and multiply. With a single 
memory path transferring one word ingle. 5 nsec, 
the low asymptotic rate of 2(80 x 10-)/3 = 53.3 
MFLOPS of (4) follows. 


Block Solver Characteristics. The above prefer- 
ence for block processing is unfortunate in the 
sense that such blocking is not unique and little 
is known of "optimal" methods of blocking a 
general sparse matrix. The equation formulation 
and the "fill" production [6] produce consider- 
ably overlapping of block structures, often 
masking the original problem structure. An ex- 
ample blocked structure is shown in Figure 3. 
Thus, a block solver must either be given the 
block structure or await the development of 
blocking algorithms. This paper considers only 
the development of the solver,in part because 
general blocking methods will undoubtedly be 
dependent on characteristics of the solver. 

A general block-oriented equation solver for 
a vector processor, although conceptually derived 
from previous single-element processing codes, 
portends to have several new important applica- 
tions. 

(a) Common matrix structures are naturally 
bloeked by local coupling and variables of nodes. 
The speed of vector processors encourage such 
coupling, especially to speed up the convergence 
of global iterative methods. For example, block 
relaxation methods are being considered to re- 
place single-variable relaxation [7]. 

(b) A memory-hierarchial processor with a 
limited vector length requires the partitioning 
of large dense systems to reduce data flow be- 
tween hierarchies--the source of difficulties in 
the line-vector accumulation above. Although 
specialized partitioned codes can be developed 
for each general class of dense matrix (full, 
band, profile, block tridiagonal), the coding in 
a low level language for a vector processor is 
sufficiently complicated that, at this writing, 
only partitioned full matrix codes have been 
developed for the CRAY-1 [8]. 

If one observes that partitioning is a sym- 
bolic rather than numeric process, it is reason- 
able to propose a solver which requires the user 
to construct only a set of block descriptors from 
the matrix structure. These descriptors would 
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then guide the numeric solution. The descriptors 
must be adequate to describe common block struc- 
tures and typical numeric storage schemes, but 
limited in number so that they do not seriously 
interfere with the numeric kernels, on which the 
processing speed of the solver depends. 


Algorithm Organization 


The algorithm can be divided into two con- 
ceptual and organizational levels: (1) the 
global level, where general blocking rules are 
established and a general block-oriented solution 
algorithm is presented, and (2) the local level 
where sub-block structure forces tradeoffs be- 
tween generality and speed. 


Global Blocking. It is proposed that the block- 
ing be performed on the LU map of the matrix 
since it is the structure of L and U rather than 
the structure of A which is important at each 
stage of the solution process. Further, it is 
proposed that the blocking be based on the size 
of diagonal blocks. In the general sparse case, 
these blocks are determined by scanning the 
matrix diagonal (in one direction) for the largest 
full, square blocks. This yields a unique diag- 
onal blocking. Row and column strips are defined 
throughout the matrix based on these diagonal 
block partitions. Off-diagonal blocks are then 
defined only within the intersection of row and 
column strips. Figure 3 shows an example of this 
blocking. 

With the blocks constrained to lie within 
such strips, it is possible to specify the block 
solution algorithm. In Figure 4, the factoriza- 
tion steps are given for a blocked matrix of the 
form 


Sot Stee te Ln, al ae 
Ag Ano x, = B, ( 5 ) 
A, 1 AY 2 X B 
b bb My | My 
wd 
where the A.. are Square dense blocks and the 


sparsity of the A.,G #3) is as yet unconstrain- 
ed. 


Local blocking. Define diagonal-based column and 
row strips (DBCS and DBRS, respectively) as those 


portions of column and row strips extending from 
each diagonal to the southern and eastern bound- 
aries of the matrix, respectively. These strips 
contain all of the A.. and A..blocks of Figure 4; 
it is their sparsities which are now of concern 
in the general sparse case. 

An examination of the substitution step 
shows that any nonzero position in either blocks 
Bence Or . will result in propagation or “raining 
oft nonzerd positions to the eastern or southern 
boundary (respectively) of that block. ‘This 
property is observed in the L and U map of Figure 
3. This effect (a) fixes the contour along at 


least one of the sides of each block, and (b) re- 
quires the maximum length of any block along its 
DBCS or DBRS to be at its eastern or southern 
boundary, respectively (Figure 5). 

Beyond this restriction, an arbitrary sub- 
structure of L and U is possible. To accommodate 
this generality in a solver requires block con-~ 
tour descriptors and, more importantly, slows the 
multiplication kernel of Figure 4. A compromise 
adopted for this study was to assume each sub- 
block extends across its DBRS or DBCS as shown in 
Figure 5; however, any number of blocks may lie 
within an intersection of a row and column strip. 
This assumption is valid for common types of 
block structures or when blocking dense matrices. 
However, when special equation ordering is used 
to avoid fill [9], nonzeros may have to be added 
to extend blocks across the DBRS and DBCS; these 
are shown as darkened areas in Figure 3. 


With the above considerations in mind, the 
following are proposed as descriptors of DBRS 
blocks. 

(bl) the row position of the (1,1) block 
position; 

(b2) the number of columns in the block; 


the number of rows is fixed by the diagonal block 
size; 

(b3) the column strip number containing the 
block; this descriptor can be determined in a 
preprocessor step; it initiates the scan to 
locate As, in the multiplication step; 

(ph the address of the (1,1) block posi- 
tion in the packed matrix numeric array; this 
permits an arbitrary numeric block storage; 

(b5) the address increment in the numeric 
array between the (i,j) and the (i,j+1) block 
position; this eliminates the need for contig- 
uously~stored block columns. 


These descriptors can completely describe the 
sparsity of the DBRS, under the above assurip- 
tions. A similar set of descriptors is used- for 
the blocks in the DBCS with the terms "row" and 
"column" interchanged, except (b5) which remains 
the column address increment to represent the 
column storage of all blocks. 

These descriptors are stored in a list in 
the processing order of the blocks. From the 
multiplication step of Figure 4, this order be- 
gins at the first DBCS, then the first DBRS, then 
the second DBCS, etc., always beginning at the 
diagonal block. This is often termed Crout 
processing order. 


Storage Considerations 


Whereas the symbolic pointers (bl-b3) are 
related to the order of block processing, the 
numeric pointers (bl-b5) are intended to allow 
an arbitrary block numeric storage. The result 
is that common compressed storage schemes can be 
accommodated; also, these pointers permit certain 
efficiencies to be taken in the blocking. 

(a) Reducing the number of blocks. ‘The 
restriction q < 64 in the numeric kernel of Fig- 
ure 2 requires that row strips not exceed 64 in 
width. This is a universal restriction. Other 
of the previously mentioned global blocking re-— 


strictions were impcsed to maintain the integrity 
of the numeric processing with arbitrary place- 
ment of blocks in the numeric storage. However, 
certain common storage schemes allow violation of 
these rules in the interest of efficiency. For 
example, a full matrix stored in column-ordered 
form in a numeric array could be partitioned as 
in Figure 6, with no column strip extending above 
the diagonal block. A single block multiply such 
as S*7T, if directed to be accumulated into P, 
would, in fact, be accumulated into blocks P-Q-R. 
One only has to insure that descriptor (b5)--the 
address increment between columns of a block--is 
the pseudo-row dimension of the numeric array 
representing the entire matrix. Of course, this 
freedom to violate a global blocking rule in- 
volves the risk of improper accumulation across 
several blocks or indeed into void areas of a 
sparse matrix. For example, no check is made 
whether the user has accounted for all the block 
fill in a sparse matrix in setting up the block 
descriptor list. 

(b) Compressed format. In Figure 7 (b)-(c), 
several standard formats of block tridiagonal 
matrices are given. Storage I is similar to that 
of banded matrices, where the diagonals of the 
matrix are stored as rows of a two dimensional 
compressed array. This storage is achieved sim- 
ply by reducing descriptor (b5) for all blocks 
to one less than the row dimension of the entire 
matrix array; a rectangular block contour then is 
skewed into a parallelogram as shown. Figure 7 
(d) shows that the numeric descriptors are dif- 
ferent for the two storage schemes, but the sym- 
bolic pointers remain the same. 


Performance Evaluation 


Figure 4 depicts the major loops of the 
factorization program and yields with Table 1, 
the corresponding timings of the overhead and 
kernel computations. These were developed using 
a timing simulator [10]. 

From these timings, it is possible to develop 
expressions for the solution time, execution rate, 


- and the fraction of time--termed the efficiency 
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n-- in the numeric kernels for any particular 
matrix structure. : 

For example, consider solution of a block 
tridiagonal matrix with the general solver. The 
execution time and efficiency are determined from 
Figure 4 to be, for n, nxn blocks, 

-1)(620 + T 


T, = 345 + (n, ae a 


vg £ 


where T., T_, and T are defined in Table 1, and 
f s m 
is ue 
n 1+ y 
where 
345 + (n, ~ 1)(620) 
Y = 


- Tp ss (1, i 1) (Tp 7 te . T,) 


For a large number of blocks, the execution rate 
and the efficiency become simply 


80(28n> ~ 9n° + Sn) 


MELOPS = 6(6004 0. + 2 + T) 
f Ss m 
n = ee ee ee ee ee 
1 + 620/(T,. + ae + oe (6) 


The efficiency of (6) is calculated in Table 
2, where reasonable efficiencies are shown for 
nh oes 

An interesting aspect of the general solver 
is that significant effort can be justified in 
development of the numeric kernels. In [11], for 
example, it is shown that 2 50% speedup can be 
achieved in these kernels for small matrices 
(without penalty for large matrices) by avoiding 
vector chaining--through which the CRAY-1 nor- 
mally achieves concurrency--and utilizing the 
vector registers instead as a dual cache memory 
which seeks to keep the floating point pipelines 
continuously busy. One can then view the speedup 
of these unconventional assembly language kernels 
as an asset which compensates for the overhead of 
list processing, the price for generality. An 
"adjusted efficiency" can then be obtained by 
multiplying the efficiencies of Table 2 by the 
kernel speedups. The product is the speedup of 
the general code over a code written for a block 
tridiagonal matrix with conventional assembly 
language kernels. These speedups are shown in 
parentheses in Table 2, for block sizes of four 
and eight, and range from 1.07 to 1.3. 

From recent comparisons of Fortran and con- 
ventional assembly language codes [12], it is 
clear that speed comparisons of specialized 
Fortran-—coded solvers and the general solver 
would result in an even greater advantage for the 
general solver. Thus, a user faced with program- 
ming a Fortran solver for a specific block struc- 
ture would be well advised to consider use of 
this general solver requiring only a block de- 
seriptor input. 


Conclusions 


The goal of developing an efficient general 
equation solver for a memory-hierarchical vector 


processor appeared initially to yield the negative 


result that a block structure was necessary. 
While challenging to algorithmists traditionally 
interested in problems with random sparsity, this 
restriction risks application to a relatively 
small class of problems. 

However, the application of a general block- 
oriented solver to regularly—-blocked matrices 
was found appealing due to (1) the difficulty of 
specialized coding in a low level language and 
(2) the necessity for blocking large dense 
matrices in a memory-hierarchical environment. 
Indeed, it was shown that in at least some cases 
a general solver offered a speedup over a conven- 
tionally-coded specialized solver. 

Extensions of the solver are planned to 
process band and profile matrices by developing a 
small library of specialized numeric kernels 
within the general solver. The solution effi- 
ciency of finite element and finite difference 


problems also needs study, although this will have 


to await a preprocessor which automatically gen- 


erates a blocked LU map from the original matrix 
structure. 
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Table 1. [Execution rate (MFLOPS)]/ 
[execution time (kiloclocks)] 
of three numeric kernels. 
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Table 2. Estimated performance of = 60 
general block sparse solver P=! (46) 
on the CRAY-1 for a large 40 (39) 
system of block tridiagonal p=1, r=1 


equations. 
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10 r=1!O unless noted otherwise 
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Figure 2. Execution rate of CRAY-1 for 


multiplication and accumulation 
of two qxp and pxr matrices. q 
is limited to 64 by vector length 
restriction 


120 


aah 
aad 


rryy aus 
revs daa 
nee ery aaa 


us prey ry 
we tee rr 


revy aaa aat 
“a rer ua 
hea rey aut 


rey tat 
aaa us 
Tey aaa 


ba 
fitter tttt) 


aac 
aac Perey 
faa rerrorrrryr 


paannnnaen 


- 
= 
aan 


Sacer bas Gang case easee 
rst ery ritririiy 


er isuiiererirerrirroriri iy 
ORO NCAU CHAE GENS VERE TES 2s 004 04 obs EARLE RARE REAR REECE 


BG RRRROS EMME CALE RAEI E 
UC CMSALGASESCRR AU ARELC REA EERE 


Figure 3. Block structure of dissected finite element matrix; 
boundary conditions account for some irregularities 
in block structure. Blackened regions indicate where 
non-zeros were added to conform to blocking rules. 
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Algorithm 


Factorization 


Factor A.. = 1L..U.. 
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Flow Chart 
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kernel 
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Subroutine 
exit linkage 


Figure 4. Algorithm and timing for factorization portion of sparse 
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solver code. 


DBRS 


ae ee 


VW J 
original non- AN non-zero positions 


DBCS 


zero positions added to fulfill 
of L and U blocking assumptions 
Figure 5. Local blocking assumptions Figure 6. Simplified blocking of full matrix 


(a) Matrix block structure 
all blocks are 4 x 4 


diagonal stripe super diagonal stripe sub diagonal stripe 


(c) Storage II 


Block Storage Symbolic Numeric 
descriptors descriptors 
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(d) Descriptors 


Figure 7. Block tridiagonal matrix and two storage descriptions. 
Circled numbers refer to numeric array addresses. 
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Abstract The continuing revolution in the fabrication of 
integrated circuits has brought us to the point where soon 
One million devices will be placed on a single chip. However, 
this miracle of microminiaturization has not been followed 
with revolutionary new computer designs. In this paper we 
discuss how the potential of VLSI! can be exploited in one 
particular domain, the processing of matrix computations. 
There are two major goals of any VLSI design which must be 
met for such a chip to be profitable. These are regularity of 
structure in the circuit layout and maximum overlap of 
computation during processing. in this paper we show how a 
single set of locally connected processors can be used to 
achieve a@ maximum overlap of computation for all of the 
following problems: {i) a matrix times a vector, (ii) LU - 
decomposition, and (iii) back substitution. Also we lay the 
groundwork for analyzing the time complexity of parallel 
algorithms executed by VLSI circuitry. Finally we introduce 
some high level programming language notation for 
expressing the execution of algorithms on multiprocessors. 


Key Words: VLSI, large scale integrated circuits, matrix 
operations, parallel algorithms. 


This research has been supported by the National Science 
Foundation under grant number MCS73- 01/1793 


We begin by considering the problem of multipiying a 
matrix times a vector. We assume the matrix is ann x n with 
elements ai and a vector x = X pKa We imagine a set of 
processors which contain three registers: R., R., and R- 
Each processor has three inputs and three outputs, one each 
to each register. As this circuit is to be placed on a single 
chip the number of external pins is limited and for this circuit 
we assume only two input pins and three output pins. A 


_partial view of this circuit is drawn below. One sees two 


Arrows indicate register 
The fact that all the 


rows of n_ processors each. 


connections between processors. 
connections are local is extremely important for VLSI design, 
see [Mead78] Register R, connections are not needed for 


this problem so they are not drawn. 


Figure 1. Circuit for computing Ax = b. 


Now, what are the operations that we require of the 
registers in these processors. All of the registers can accept 
and send data over their register connection lines. In 
addition each register can transfer its data to another 
register in the same processor. Finally each register can 
perform the operation R. « R. + R, * Rp (and similarly for R, 
and R,), plus the operation R. « RAR, (or Rp e RL/R. or R, 
R/R,). 


We assume that all of the operations take the same 
amount of time, which we call one time unit. We also assume 
that all of the processors execute some instruction at every 
time unit, though that instruction may be a “do-nothing" 
instruction. The algorithms presented here are assumed to 
reside in the processors so no broadcasting is necessary. 
Note that the algorithms given here are not systolic in the 
sense of Kung and Leiserson, [Kung79], but they definitely 
have been motivated by that paper. The algorithm for 
computing Ax now follows. 


124 


zero out all registers 
forie¢ 1 tondo 
in parallel do //all statements in the scope of an in 
parallel// 
read x; into RL (Po is l<ksn. //statement are 
executed concurrently// 
read column i into RA(P Ks l<ksn. 
Re(Po 4) - RE (Po 4) lskéen. 
end 
Ro(P ay) t Re(Po 4) lL<sk<n. // This operation is called 
Shift-down // 
R, «R, + Re * R. for all processors in row 1. // 
multiply-and-add // 
repeat 
Ro (Pay) € Ra(P 1 ks ls<ké<n. 
output b,,...b, from Rp (Py // move-right-row1! n times 
fl 


Algorithm 1. Computing Ax = b. 


In figure 2 one sees snapshots of register R, for the n 
the 
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processors in row one immediately following 


multiply-and-add operation. 
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Figure 2. 


lf we consider the complexity of this algorithm we first 
observe that no algorithm can require less than n@ time units 


elements must be read into the chip. 


2 


as at least n@ 


Moreover one can show that at least n° + n units of time are 


required before the final output is produced at the output 


gate. (Consider the last element of A which is input at time 


2 


n= and must at least move to the output gate.) 
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Examination of the body of the for loop shows that 
everything within the in-parallel statement is accomplished in 
The remaining two steps in the for loop 


Outputting the 


n@ units of time. 
take unit time each pass for a total of 2n. 
final values takes another n+1 steps plus the time to zero out 
the registers takes one step, so the total time taken is n@ + 
3n +1. Only an additional 2n + 1 units of time are required 
over the theoretical minimum. Therefore we say that the 


complexity of this algorithm is 2n + 1. 


We now consider the second of the three problems. Given a 
lower triangular matrix L with elements ai for i 2 j anda 


vector b = Dyed, to determine the values of the vector x 
such that Lx = b. 
configuration of processors having the same capabilities. 


Once again we assume the same 
The algorithm follows. 


zero Out all registers 

forie ltondo 

//total time is n(n+1)/4 // 
read a5i_y into Ry (Po 4) 1 < kK < i. 
Re(Poy) € REPO) 

Shift-down 

read boi} into Rp (Py) 


in parallel do 


end 

R, « RE - R, + Ro except Pai where Rp t RL/R 
Ra(P) |) ¢ Rp (Py i) //save new x; II 

RAP) é Ri (Py py) //move-right-rowl // 
// total time is n(n+1)/4 // 
read 49i-k+ 1k into RE (Po 4) l<k $i. 
Re(Po 4) t RE (Pon) 

Shift-down 
read bo: into RAP yy) 


Cc 


in parallel do 


end 
R, e Ri, - R, t R. 
RAP i S RiP ay) //move-right-row1 // 
repeat 
Re ik) - Ro(P 1) , lsken. 


output x),-.%, from RE (Py) 


n 


Algorithm 2. Computing x: Lx = b. 


‘Figure 3 shows snapshots of the algorithm as the 
computation progresses. As for the complexity we observe 
that the matrix A contains n(n+1)/2 elements which is the 
total time for both read statements which input A. The time 
to read in the b values is overlapped with the reading of A. 


The remaining statements within the main for loop each take 
n units of. time for a total of 5n, zeroing out the registers 
takes unit time as does the transfer from register R, of row 
one to register R- The actual output takes an additional n 
units of time and totalling these gives n(nt+1)/2 + 5n+1 +1 
+ n= n(nt+1)/2 + 6n + 2. Using an argument similar to the 
one used before, a lower bound is given by n(n+1)/2 +n + 1. 
Therefore the complexity of this algorithm is no more than 5n 
+ 1. 
i 
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Figure 3. Computation of x given Lx = b, L lower triangular 
We now consider the last of the three problems. 
Given a matrix A we are to determine a factorization A = LU 
where L is a lower triangular matrix -and U is an upper 
triangular matrix. One way to produce this factorization is to 


use Gaussian elimination. If we assume that the matrix is 
either symmetric, positive definite or irreducibly diagonally 
dominant (as first observed in [Kung 79]) then pivoting is not 
necessary. A sequential version of Gaussian elimination 


without pivoting follows. 


fork e 1 tondo 
for i« k+1 to n do 


Mk & 85 1K 
for j + k+l to ndo 


Algorithm 3. Sequential Gaussian elimination. 
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When executed on a the 


complexity of this algorithm is O(n). The m,, are the values 
of the upper triangular matrix U, while the new values of a. . 


I,J 
are the values of the lower triangular matrix L. As there are 


sequential processor 


n@ elements to be input we cannot expect to do any better 
than to overlap all of the computation as the input values are 


being read. The circuit for this computation is given below. 
Observe that this circuit contains in its first two rows the 
circuit we were using to compute b = Ax and to compute x 
where Lx = b. 


y 


Figure 4: Circuit for determining LU = A, Ax and x: Lx = b. 


In figure 4 we assume that. the processors are 
numbered as P(O:n+1, 1:n+1) with the input processors in row 
zero and row one, and the output processors in row n + 1 
and column n + 1. Before we present the algorithm we need 
to explain some abbreviations. The Shift Down operation is 
dae generalized and is equivalent to Re(P; 5) ¢ Re(Pi_y 3) for 1 
Sijsn+i. The algorithm follows. 


zero out all registers ; RAP De ayy 
for i+ 1 to 2n-1 do 
ifisn then RA (Py) ¢ first element of row i 


RP) © Ry(P yp /Ra(Py 1) 
endif 


for | © 0 to FLOOR(i/2) in parallel do 
Ry(P i 4) t Rh (Pi) for jtl ¢k sn 


ifi>nthen RAP, ) goes to output fori-ns<ks 


repeat 
RolPon) © Re Poy) 
Shift down 
Multiply and Add //R. © R. - RéR, // 
skeen 
for j © 2 to CEIL(i/2) in parallel do 
een CL 
repeat 


if odeki) then RePeen ciy2yn) © RalPceis2yy? 
repeal 


al), 


Algorithm 4, Computing £,U; A = LU. 


Examining the complexity of this problem we observe 
that n@ elements must be input, so that is an obvious lower 
bound. In the algorithm each row is input separately for a 
total of n@ units. At the same time as the input is being read 
in, register R, in the processors on the main diagonal of the 
grid are having their values propogated across their rows. 
This is done simultaneously with the inputting or the 
outputting. Thus the total time for the first inner for loop is 
n2. The operations described by the second inner for 
statement are also done in parallel, for a total cost of 4n-2. 
All other statements within the main for loop take n units to 
Summing up we getl+142n4¢n°+3n4¢n¢ 
2 Below is a snapshot of the algorithm 


total time. 
4n-2 +n=n% + Lin. 


as it processes a 4 x 4 matrix. 


Conclusions 


For all of the three problems we have seen, a single 
Each of the 


processors on the chip requires a small set of simple 


chip design was used to compute the solutions. 


capabilities and they are locally connected. The computing 


time analyses have intentionally included !/O as well as | 


processing time as we feel this is an especially critical 
parameter for VLSI. The algorithms given here show that 
methods can be devised which overlap almost all of the 


computation with the input and output. 
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Figure 4. Computation of L, U such that A = LU. 
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Abstract Parallel processors with bit- 
serial PE's usually implement arithmetic func- 
tions by a sequence of word parallel arithmetic 
Operations; however, basic operations must be 
specified at the bit-serial level. In this paper 
the possibility of more efficiently implementing 
a function with a special, tailored sequence of 
bit-serial operations is considered. A general 
scheme is described for generating efficient pro- 


grams to implement arbitrary functions on bit- 
serial-arithmetic processors. This scheme is 
based on logic design methodology and involves 


designing a Logic network to realize a desired 


function. The parallel processor is then used to 
efficiently simulate a set of these networks. 
Heuristic design algorithms are used to generate 
the Logic networks; several algorithms are 
described and compared with some benchmark func~ 
tions. 


This scheme is suitable for implementing any 
arbitrary function, however the computation cost 
increases very rapidly with the number of inputs 
to the function. Examples are given of some 
8-input function implementations. 

Programs may be generated for most 
parallel processors. Several 
designs are described and analyzed. 


bit-serial 
efficient PE 


Introduction 


Bit-serial parallel processors usually specify 
arithmetic functions at the word parallel Level. 
However, basic operations must be specified at 
the bit serial level which opens the possibility 
of developing special, tailored, instruction se- 
quences for implementing specific functions. On 
conventional computers, operations are at _ the 
word Level and very efficient multiply and divide 
operations are usually available; moreover, for 
fast function implementation a table look-up 
method can be used. Fast multiplication and 
division is difficult to implement on bit-serial 
parallel processors and an indexing mechanism, 
necessary for a table look-up scheme, is usually 
not available. 

A function may be considered as a mapping 
rather than an arithmetic expression. For exam- 
ple, in an image processing application a _ loga- 
rithmic transform of an image may be required; 
typically the image may consist of 8-bit integer 
brightness values and the desired result may also 
be a scaled 8-bit integer. Therefore we wish to 
compute the function L255° Log(X+1)/log (256) J 
where X is the value of each element of the im- 
age. The conventional approach to this problem 
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is to use an arithmetic approximation technique 
to compute the function which must be implemented 
with sufficient precision to avoid round off er- 
rors. An alternative, more direct, approach is 
to consider the function as a specified mapping 
from 8 inputs to 8 outputs and achieve this map- 
ping in one stage with a sequence of single bit 
operations. This direct approach is useful for 
existing bit-serial parallel processors such as 
STARAN C1] and also for emerging parallel proces- 
sors based on LSI technology such as CLIP4 [2], 
MPP [3] and DAP C4]. 

A general scheme for implementing arbitrary 
functions on a bit-serial SIMD processor is 
described. To utilize this scheme the PE's of 
the processor must be capable of computing logi- 
cal functions of 2 operands; however, only a sub- 
set of the 16 possible Boolean functions is re- 
quired. 

The general procedure for implementing a func- 
tion is as follows: First a truth table of all 
possible outputs for all possible inputs is 
created. A conventional computer could use a 
table-look-up algorithm at this stage; however 
bit-serial SIMD processors usually cannot effi- 
ciently implement such algorithms. In a SIMD 
processor each PE would have to store the com- 
plete truth table; moreover each PE would also 
need. a special, individual indexing mechanism to 
access the table data. 

Second, an efficient program is developed from 
the truth table which will implement the desired 
function. Several algorithms for automatically 
generating efficient programs from the truth 
table are described in this paper; these algo- 
rithm are based on Logic design methodology. A 
Logic network is designed which will realize the 
desired truth table, then this design is 
translated into a program sequence for the paral- 
lel processor. Typically, one processing in- 
struction on a bit~serial PE may be considered to 
simulate a two-input logic gate. 

A suitable design algorithm should generate 
networks with the following characteristics: 


(a) The total gate count in the network should 
be a minimum since each gate implies at 
Least one process instruction in the final 
program sequence. 

(b) The network can consist only of two input 


gates; however, many different Boolean func- 
tions may be available at equal cost. 


(c) The number of Logic levels in the network is 
unimportant. In fact, a cascade of N 2-in- 
put gates can usually be realized more easi- 
ly than a balanced binary tree of N gates. 

be 


(d) Usually multi-output logic networks will 


required. 


Currently there is no simple, efficient design 
algorithm which will generate a minimum logic 
network with the above characteristics. An ex- 
haustive selection algorithm would require too 
much computation for practical sized problems. 
Several different heuristic algorithms have been 
developed for designing efficient multi-output- 
Logic networks with 2-input gates. 

An alternative design strategy is to create 
networks with Universal logic modules (ULM's). A 


ULM is a 3-input device having two data inputs 
and one selection input. In some processor 
designs a one~bit mask register is associated 


with each PE which, when set, inhibits a PE from 
processing data. These processors can efficient- 
ly simulate a network of ULM's by using the mask 
register for the selection input. 

The main advantage of the described scheme is 
its generality in that a program for any arbi- 
trary function may be generated including a ran- 
dom mapping. The two main_ problems with the 
technique are: (a) the computation time for the 
Logic design algorithms is proportional to two to 
the power of the number of inputs, and (b) the 
Logic design algorithms do not generate minimum 
designs and in some cases a more efficient pro- 
gram can be developed with analytic methods, e.g. 
a program for bit-serial addition. Several il- 
Lustrative functions which have been programmed 
with the Logic design scheme are discussed. 


Logic Design Algorithms 


Two schemes for designing networks with 2-in- 
put gates have been investigated. In the first 
scheme a two-level minimal sum of products design 
is decomposed into a network of 2-input AND gates 
followed by 2-input OR gates. In the second 
method the best Reed-Muller-Unate expression of 
the Logic function is decomposed into 2-input AND 
gates followed by 2-input Exclusive-OR gates. 
Both of these schemes have been extended to 
multi~output designs. 

An alternative approach is the method of dis- 
junctive decomposition [5] [6]. However, those 
techniques work best for disjunctive functions 
which do not frequently occur in_ practice. 
Moreover, it is more difficult to make such 
schemes share logic in multi~output designs. 

Several minimization algorithms have been 
developed for networks of Universal Logic Modules 
(ULM's). A recent paper by Voith [7] describes a 
minimization scheme for single control input 
ULM's. Some single output functions have’ been 
realized with ULM's; this design method, Like 
disjunctive decomposition, is difficult to extend 
to efficient multi-output designs. 


AND-OR Network Designs 
im- 


For the AND-OR network design, the prime 
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plicants for the required function are generated 
by a recursive concensus tree algorithm C8]. The 
following simple algorithm is then used to select 
an irredundant cover. It uses three Lists: a 
List of selected implicants, SI; a List of 
remaining implicants, RI; and a list of terms’ to 
be covered, TC. 
Prime implicant selection algorithm: 

1. Set SI to all essential prime implicants; 
set RI to all non-essential prime implicants; and 
set TC to all minterms in the desired function 
which are not covered by essential prime impli-~ 
cants. 

2. Select the prime implicant from RI _ which 
covers the greatest number of minterms in TC. 

3. Add the selected prime implicant to SI and 
delete from TC all covered minterms. 

4. If TC is empty, the selection is complete 
and SI contains the required irredundant cover. 
If TC is not empty return to step 2. 

The following general gate selection algo- 
rithm, which can decompose a set of terms into 
two input gates, is then used to decompose all 
the product terms into 2-input AND gates. Some 
of the 2-input AND gates which are connected to 
inputs of the function may have inverted inputs. 


Gate Selection Algorithm: 


function 
(including any intermediate gate out- 
puts) which could contribute to any of the 
prime implicants is examined. The number of 
implicants to which each of these gates con- 
tribute is recorded. 


‘Every possible 2-input gate between 
inputs 


The gate which contributes to the most prime 
implicants is selected. The representations 
of the prime implicants which are affected by 
the selected gate, are modified to indicate 
its presence. 


If all prime implicants are completely imple- 
mented then the selection is complete; other- 
wise go back to step 1. 


For a single output network, the product terms 
may be connected by a cascade of OR gates. For 
multiple-output networks, the product terms’ for 
all outputs are first generated, then a network 
of OR gates is generated using the gate selection 
algorithm described above. In this case the in- 
puts to the OR gates are the outputs of the pro- 
duct terms and the outputs of the OR gates are 
the function outputs. 


In this section networks consisting of a= set 
of product terms followed by a set of exclusive-— 
OR gates are considered... It has been conjectured 
that networks of this form may be constructed 
more economically than the conventional AND-OR 
form; algorithms have been developed for minimiz— 
ing a subset of the AND-~exclusive-OR networks 


called the generalized Reed Muller canonic. forms 
(9-113. An important feature of these canonic 
forms is that they are unate, i.e. an input may 


occur either in its true or complemented form but 


not both for the same function. 
A common expression for a Reed Muller expan- 
sion is: 


F(x, eeex ) = ay ® a4X4 ) a5xo°** a x ® 


ane X4Xo @© ee ) XqXo°* Xp, 


Where a. are 0,1 coefficients and Xs is con- 


sistently either x, or X.. 


There are a maximum possible of a" terms in 
each expression. Also, for a completely defined 


function there are 2" different canonic forms, 
one for each possible arrangement of true and 
complemented inputs. Conventionally, a minimum 
unate network is determined by selecting the 
canonic form containing the minimum number of 
terms; this results ina minimal unate design. 
For a two input gate network the product’ terms 
are decomposed with the gate select algorithm and 
a cascade of exclusive-OR gates is used to com- 
bine the product terms. Due to the unate proper- 
ty of the product terms a large amount of sharing 
of two input gates between the product terms is 
usually possible. 

For multi-output designs two algorithms have 
been investigated. In the first, called 
multi-~polarity, the terms for the best canonic 
form for each output are selected and the gate 
selection algorithm is used to implement them. 
Then the gate selection algorithm is used to gen- 
erate a network of exclusive ~- OR gates in a 
similar way to the AND-OR multi-output design. 
As a different canonic form may be selected for 
each of the outputs, the multi-output network is 
no longer unate. The second algorithm determines 
which canonic form requires the least number of 
terms to generate all the outputs. The multi- 
output network is then generated in a similat way 
to the multi-polarity algorithm except that the 
same canonic. form is used for all outputs; this 
results in a unate multi-output design. The 
unate scheme has the advantage that the product 
terms are selected from the same set for each 
output which increases the possibility of two or 
more outputs sharing similar terms. The problem 
with the unate algorithm is that some outputs may 
have a very simple realization with a different 
canonic form. 

One fact which is not used in either algorithm 
is that, unlike the conventional product of sums 
Canonical form, the number of variables in each 
product term is not a constant. Therefore in 
selecting the canonic form to be implemented, the 
cost of its constituent product terms should be 
considered. 


An algorithm for designing networks of arbi- 
trary functions with ULM's has been described by 
Voith £7]. We have developed the following sim- 
ple recursive algorithm which is currently being 
evaluated. 


ULM Minimization Algorithm: 


1. Find the input which is most correlated with 
the. output, i.e. select the input which has 
the most matches with the output in either 
true or complemented form. 


2. Use this input Xs to control a ULM which par- 


titions the function into two sub-functions 


FOO = Xi FCxs ex) FOO = Ki FCxseeex ) 


3. Apply this algorithm recursively to FOO un- 


til all inputs are constant 1 or 0. 


4. Apply this algorithm recursively to f.O0 un- 


til all inputs are constant 1 or 0. 
For the worst case the above algorithm would gen- 


erate a binary tree of n levels involving 2".4 
ULM's. However, the Lowest Level can always be 
removed by applying the Last variable values to 
the inputs of the next level; therefore a maximum 


of girls ULM's could be required. Moreover, by 


choosing a suitable selection strategy at step 1 
of the minimization algorithm the total number of 
ULM's required is usually significantly less than 
the worst case. So far this algorithm has only 
been used to generate single output networks; 
there is no simple way to extend to efficient 
multi-output designs. 


PE Architectures 


In this section several possible PE architec- 
tures for efficiently simulating logic networks 


il : 


PE1 basic architecture 
Fig. 1 


PE2 BASE architecture 
Fig. 2 


are considered. The basic functional units of a 
PE are shown in Fig. 1. Operands (gate inputs) 
are obtained from a 1-bit-wide memory M and each 
two-input gate is simulated by a Boolean proces- 
sor F, which can realize any of the 16 possible 
functions of its inputs. Two single-bit regis- 
ters, A and B are used to temporarily hold _ the 
inputs for F. A single isolated gate requires 
three memory cycles: two to load A and B, and one 
to store the result. 

The implementation of a single output network 
of N gates with a fan-out Limit of 1 will be dis- 
cussed first. The more general case for a 
multi-output function involving gates with a fan 
out greater than 1 will then be considered. The 
architecture PE1 can implement any network in 3N 
memory cycles. One way to improve the PE perfor- 
mance is to use the BASE architecture [£12] shown 
in Fig. 2. In this case a three~input, general 
Boolean processor, Fe, is used. A pair of con- 
nected 2-input gates can be simulated in an 
operation involving four memory cycles. There- 
fore, any network can be implemented in 2N memory 
cycles. 


PE3 


accumulator architecture 
Fig. 3 
one 


In the architecture PE3 shown in Fig. 3, 


of the holding registers, B, is used as an accu- 
mulator. The B register is not directly connect- 
ed to M; however, it may obtain data from M 


through A and F without involving any more memory 
or clock cycles; when B is loaded from A, A is 
loaded from M. This scheme only involves’ the 
same number of control inputs (7 + memory ad- 
dress) that PE1 requires. In fact one control 
input may be discarded, as the enable inputs for 
the A and B registers always have the same value. 
The efficiency of this PE depends upon the topol- 


ogy of the network to be implemented. . Two ex- 
treme cases exist: a cascade network which is the 
best case and a balanced tree network which is 


the worst case. Note, this is the reverse of the 
conventional logic design scheme where it is im- 
portant to minimize the number of Logic levels. 
The PE3 architecture requires Nt2 memory cycles 


to implement a cascade network and a worst case 
of 2N+1 memory cycles to implement a balanced 
tree network. 

Various alternative architectures have been 


developed for more efficiently implementing bal- 
anced tree networks. A simple modification to 
PE3 is shown in Fig. 4; the register B is now the 
top element of a stack S. An extra control § sig- 


nal is required so that A may be loaded with the 
Output from F. This architecture © requires 
1°S(N+1) memory cycles to implement a balanced 
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Fig. 4 


PES Optimal balanced tree architecture 
Fig. 5 


binary-tree network. An optimal architecture is 
shown in Fig. 5, it involves a FIFO buffer and 
additional selection logic. The minimum number 
of memory cycles to implement a balanced tree, 
with a single memory M and a_ two-input Boolean 
processor F is Nt1 + log. (N+1). 


The main characteristics of the various archi- 
tectures shown in Figs. 17-5 are summarized in 
Table 1. The worst case for a multi~output net- 
work with K outputs occurs when there are K dis- 
joint networks with N/K gates each. If a gate 
has a fan out greater’ than one then an extra 
memory cycle is required to store the output of 


this gate into M. Therefore, if R gates ina 
network have a high fan-out then R_ additional 
memory cycles will be required. The worst case 


number of memory cycles to implement a network 
having K outputs and R_ high-fan~out gates is 
given in Table 1. 

Designs with ULM modules usually result in 
tree networks. In Table 2 the number of memory 
cycles for implementing a balanced tree of N 
ULM's is given. The performance of the PE's is 
considerably improved if an extra register is 
used to. store the selection input (plus a small 
amount of additional logic). The performance of 
the PE'"s with this modification is also given in 
Table 2. 


Results 


Several benchmark. functions have been used to 
evaluate the various design schemes; these are 
described in Table 3. The gate counts for imple- 
menting these functions with different algorithms 
are given in Table 4. 


No. of memory cycles Control inputs 
Balanced Worst 
Architecture Cascade Tree Case 


ee 


(excluding 
M address) 


a ae i ee 


N+K+R+K Log. (N/K+1) 


(N+1)+Log. (N+1) 


Table 1. 


Performance of PE Architectures 


No. of Memory Cycles 


With extra 
register 


Original 
design 


Architecture 


8N 4N 


_ 4N = 
6N 3N+1 
5°5(Nt+1) 2°5(N+1) 
3N+1+ log. (N41) 2N+1+ log. C2N+1) 


Table 2. 


Efficiency of PE Architectures 
for a Balanced Tree of N-ULM's 


es 
ems [oe 
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Table 3 
Test Functions 
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AND-OR 
Function single multi 
output output 
26 


105 
573 
686 
079 

39 


AND-Exclusive-OR 


ULM 
multi-ouput (single output) 
single multi Unate | ULMs + 2-input Gates 
output polarity 


17 


Table 4 
Gate Count for Different Algorithms 


For the AND-OR networks figures are given’ for 
a set of single output networks (one for each 
output) and for a multi-output design. For the 
AND-Exclusive-OR networks figures are given for 
both the multi-polarity and unate multi-output 
algorithms. The multi-output designs are con- 
sistently better than the single output designs 
and for all 8 input functions the unate AND- 
Exclusive-OR algorithm gives significantly the 
best results. 


The number high fan-out gates affects the cost 
of implementing a network. However, the number 
of high fan-out gates for the AND-Exclusive-OR 
unate implementations of the log function is only 
32 or 7.6% of the total gates required. 


In the Lst column of Table 4 figures for sin- 
gle output ULM designs are given. It was found 
that more than half the ULM's in each design 
could be reduced to 2-input gates; therefore fig~ 
ures for each ULM design are given as a number of 
ULM's plus a number of 2-input gates. The gate 
counts for ULM designs compare well with the oth- 
er algorithms; however, the ULM is a 3-input dev- 
jce and requires at least one more memory cycle 
to implement than a 2-input gate. 


Conclusion 


A scheme for implementing arbitrary functions 
on bit-serial parallel processors has. been 
presented. Several relevant, heuristic logic 
design algorithms and special PE architectures 
have been described and evaluated. The PE archi- 
tectures of most current bit serial parallel pro- 
cessors are suitable for implementing this scheme 
and a_ small improvement in performance could be 
achieved by adding a stack mechanism to the PE's. 

Results from current design algorithms indi- 
cate that an arbitrary 8~-input-8-output function 
could be implemented in Less than 1000 memory cy- 
cles on many bit-serial PE architectures. This 
is an efficient way to implement functions with a 
complex form; however if the function has a sim~ 
ple arithmetic solution, e.g. addition or two's 
complement, then conventional arithmetic tech- 
niques may be more efficient. Further work is 
needed to improve the heuristic logic design al- 
gorithms. | 
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Summary 


Numerous scientific applications of computers require 
reliable resulis. Several approaches offer assistance to 
the user in this field [1]. Range arithmetic serves as a 
valuable tool for the definition of upper and lower limits 
for the results from numerical computations 2. Several 
software implementations as ALGOL or FORTRAN 
routines of range arithmetic have appeared [3, 4]. All 
of the programs that give complete algorithms suffer from 
some or all of the following drawbacks: 

a. Extremely poor execution performance; 

b. Approximations to the true bound values; 

c. Requirements in software for direct machine 

accesses; efc. 

Implementation in a programming language of higher 
level is required for the routines in order to achieve 
transportability between machines, which in turn pro- 
hibits access to the machine facilities for proper hand- 
ling of the data. Additionally, computers of existing 
design are poorly suited to range arithmetic. 


A bit-slice multi-microprocessor system has been desig- 
ned as a better, more appropriate, and efficient solu- 
tion for the implementation of range arithmetic. The 
special features of this design appear in a parallel 
computer structure with, in the minimal configuration, 
two semi-independent bit-slice microcomputers with 
common input and intermediate registers under the con- 
trol of an execution processor. The microcomputers 
have a common microprogrammed control memory for 
the execution of the algorithms of the range arithmetic 
as far as possible and separate control memories for the 


decision logic and for the specific operations, see Fig.1. 


The control processor executes during operation all 
statements for control and fixed-point arithmetic, but 
not the floating-point arithmetic [5] and the decision 
sequence. These portions are executed by the two 
bit-slice microcomputers in parallel. One computes 
the result of the required computation for the upper 
bound, the other for the lower bound. Number and 
kind of operations for the execution of an arithmetic 
instruction are the same, but working on different oper- 
ands. A common control store can so be used for both. 
Thus, the system’s architecture follows closely the re- 
quired tasks. Execution time for an instruction is close 
to the one the control processor needs for fixed-point 
data[5]. This configuration permits compliance with 
such an optimal mode of execution. 


The data words for the floating-point arithmetic conform 
at present to the single precision data formats of the 
360/70 computer family for reasons of comparative inve- 
stigations. 
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The system is being implemented and tested currently 
with the Intel 3000 series microprocessors [6] for the 
microcomputers for range arithmetic and an 8080 as con- 
trol processor. A detailed report of this investigation 
will appear later [7]. Estimates for the performance of 
the present configuration show a significant decrease of 
computer time compared to execution of range-arithme- 
tic on serial computers of medium size. 
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Summary 


Parallel algorithms FCP and FBP are pro- 
posed for solving the first biharmonic boundary 
value problem on a unit square Q with the boun- 
dary Q. A function ufx,y)is to be found which sa- 
tis fies 

DAu(x,y)= flx,y) 
and u(x, y)= a(x, ¥), 

u_G y= r&,y) 


The functions f, g, r are given and UL denotes 


in Q 
onQ . 


the outward normal derivative of u on or 


Superimposing a square grid with mesh size 


h (nay * (Nn = ot - 1 for some positive inte- 
ger lj, a semidirect method [1]is applied for sol- 
ving this problem. For this purpose, the bihar- 
monic equation is treated as a coupled pair of dif- 
ference equations, Using this method, two block- 
tridiagonal systems of linear algebraic equations 


of order N? with smoothing need to be solved in 
one iteration [1] - [3] . This process terminates 
when the difference of two following iteration va- 
lues in absolute value is less than a given —& >0O 
in all interior grid points. 


We shall assume a parallel computer of SIMD 


type consisting of Nn? identical processors, Any 
arithmetic operation performed in one time step 
on several processors is assumed to be one ope- 
ration step. The usual computation of one itera- 
tion [1] ; [2] can be replaced by the computation 
of one three-term recurrence formula, as sug- 
gested in [3] . Using this formula, parallel algo- 
rithms FVP and FWP described in [4] , require 
respectively 14 log,N and 18 log.N parallel steps 


per iteration, 


In both the algorithms presented, a vector 
of values for one iteration is computed by the for- 
mula mentioned, too. Inthe FCP algorithm, the 
cyclic odd-even reduction algorithm [5] is applied 
for evaluating this formula. One iteration can be 


obtained on Nn? processors in 12 log,N steps only. 


The stable variant of the cyclic odd-even redu- 
ction algorithm was developed by Buneman [5]. 
The effective application of this method was do- 
ne in construction the FBP algorithm, requiring 
20 log,N parallel steps per iteration on N*“ pro- 


cessors., For comparison, the algorithm presen- 
ted in [2] requires 24 log,N steps per iteration 


2 
on N’ processors, As shown in [3] , the number 
of iterations required for the algorithms mentio- 


ned is o(nt/tog,N), when €= O(n?) and the op- 


timal smoothing parameters [1] are used, 


The advantage of both the algorithms FCP 
and FBP lies in replacing N multiplications of 
real vectors with full matrices of order N by 
the multiplication with the diagonal ones only. 
We note that when the number of processors re- 
quired is reduced by half, both the algorithms 
can be used in an efficient manner, too. 
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H4B 1R6 
Summary 
The importance of Integer Programming [1] literature confirm the feasibility of the 
in solving problems of practical importance has approach and show an improvement in the range 
led to a great amount of effort in recent years of 2 to 3 orders of magnitude over the publish- 
on methods of their solution. In many of these ed results. 
problems, the integer variables are further 
restricted to take one of the values, O or l. References 
The further importance of solving the 0-1 problem 
lies in the fact that the integer problem can be [1] Dantzig, G.B., "On the Significance of 
converted to an equivalent problem having oniy Solving Linear Programming Problems with 
bivalent variables. The solution space of the Some Integer Variables", Econometrica, 
0-1 problem involving 1 variables is finite and Vol. 29 (1960), pp. 30-44. 
consists of 2° possible points or nodes. A 
straightforward method of solving the problem [2] Desai, B.C., "An Implicit Enumeration 
is via an exhaustive or explicit examination of Algorithm to Solve the 0-1 Problem", 
each of these possible points in the solution to be published. 
space. This approach may be suitable when the 
value of m is small, but since the size of the [3] Desai, B.C., "The BPU, A Staged Parallel 
solution space increases rapidly with n, the Processing System to Solve the Zero-One 
time required to explicitly enumerate the solu- Problem" Proc. of ICS '78, Taipei, Taiwan, 
tion space becomes prohibitive. Thus, the December 1978, pp. 802-817. 


technique of implicit enumeration is used. HOST COMPUTER 


ee 


A simple, generally applicable implicit | StPeavtsor STORE 


enumeration algorithm for the 0-1 problem was id 


rer re mer 


IMITIALIZER | 
CONTROLLER 


presented in [2,3]. In this paper, we present 

a microprocessor implementation of the algorithm, 
as a system having a number of processing stages, 
with each stage having a number of microproces- 
sors (referred to hereinafter as processors). 
Since the computation involved in the algorithm 
is divided into a number of stages, it is 
possible to assign a special processor to each 
stage to perform their respective tasks. Fur- 
thermore, each stage could have a number of 
processors working independently on different 
sets of data; the actual number of processors 

in each stage could be chosen to match the 
processing load of the stage and thus optimize 
their utilization. Each stage would have a 
controller to orchestrate the processors contain- 
ed therein and control the communication between 
the stages using appropriate buffers. The 
overall supervision of the solution process can 
be performed by a supervisor. 


HOST COMPUTES 


R__JCONSTRAINTS 
DATA 


VERIFIER 
CONTROLLER 


The system, called Bivalent Processing Unit 
(Figure 1) is an example of a Staged Multiple 
Instructions Stream, Multiple Data Stream 
(SMIMD) system [3]. This system could be inter- 
faced to a general purpose digital computer , ' ! 
which would communicate the user problem to the a 
processing unit and the BPU would generate the oe ele Futerion 
solutions, if any, and transmit them back to anes 
the host computer for communication to the user. | 
Simulation results of the operation of the BPU a 
on a set of test problems from the relevant 


RGST COMPUTER 


THE OVERALL BPU STRUCTURE 
Figure 1 
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Abstract: Known solutions of a large number of 
important (and difficult) computational problems, 
called NP-complete problems, depend on 
enumeration techniques, which examine all 
feasible alternatives. This paper considers the 
design of enumeration schemes in a distributed 
environment in an attempt to exploit the parallel 
activities inherent in enumeration algorithms. 

An overview for the general enumeration 
techniques used to solve NP-complete problems, 
namely Integer Programming, Dynamic Programming, 
and Branch and Bound is presented together with a 
discussion of the suitabilty of each as a basis 
for distributed enumeration algorithms. It is 
shown that a variation of Branch and Bound is the 
most suitable enumeration method for distributed 
processing. An approach that takes into account 
communication and computation time for analyzing 
distributed algorithms is then discussed. The 
approach is then used to analyze the performance 
of the proposed BB algorithm. 


Introduction: 

During the last decade, a special class of 
computational problems, called NP-complete 
problems [Cook, 1970], that has defied all 


attempts of analytical (polynomial) solutions has 
been identified. This class includes many famous 
and important problems such as the general 
scheduling problem, the travelling = salesman 
problem, the graph partitioning problem. The 
only known technique to obtain a solution for any 
of these problems is to formulate them as an 
enumeration of all the possible permutations over 
some set of objects. 


Also, in the last few years a new class of 
problems has emerged in connection with the 
design of distributed software. Motivated by the 
evolution of microcomputers, a new type. of 
machines, called network computers([Bell and 
Newell, 1971] and [Huen et al., 1977]), has been 
recently designed. They are composed of a number 
of computers (until now mostly microcomputers) 
connected together in the form of a network and 


cooperating to perform computational jobs. One 
of the most important areas of research in 
connection with these machines is designing 
software for them. This software must_ be 
designed in a distributed form as a number of 
cooperating tasks that can run- on _— separate 
computers of the network. A_ recent’ study 
[El—Dessouki, 1978] shows that many software 


partitioning problems on network computers are 
NP-complete problems. Also, enumeration 
techniques to solve these problems consist of 
activities that can be carried out in parallel. 
There is consequently a strong need for designing 
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a parallel enumeration technique for these 


problems in general. 


Partitioning programs automatically to produce 
clusters distributed to run on a_ network 
computer, iS an example of distributed software 
design. This problem is important in that its 
solution makes network computers self-sufficient, 
bringing into existence mechanisms that work on 
network computers to produce software for network 
computers. 


In the distributed environment of network 
computers, software should be designed with the 
following charactersitics: 


1. Software must be in the form of clusters of 
small processes cooperating together via messages 
to perform the computational job. 

2. Each cluster must be small enough to fit in 
the memory of one computer, 

3. Processes in the same cluster may communicate 


using common data structures. But processes in 
different clusters can communicate only by 
sending messages. 

4, The communication overhead incurred in 
message sending is significant. Consequently the 
amount of communication between different 


clusters should be minimized. 

5. In order to utilize the parallel processing 
capability that the network offers, processing in 
different computers should be overlapped and 
equally distributed. 


In a network computer environment, information 
sharing requires message overhead and causes a 
Slow-down and poor utilization of the whole 
machine. It may be more efficient to duplicate 
parts of the distributed code in more than one 
machine to save the time spent in message passing 
and obtain a better overlap of computations on 
different computers. This observation is the 
basis of many techniques proposed in this paper 
for designing distributed software for network 


computers, 
The problem to be solved in this paper is to 
determine an appropriate technique for 


distributed enumeration on a network computer. 


The paper first discusses known mthods_ for 
designing enumeration schemes on a_ single 
computer. Based on a comparison between these 
techniques, Branch and Bound (BB) is, then, 


chosen as a basis for designing a new distributed 
enumeration algorithm. A method for analyzing 
distributed algorithms, that is characterized by 
taking into consideration the speed-up factor 
resulting from multiprocessing, processor 


utilization, storage requirements, as well as the 
communication overhead, is then presented. Two 
types of communication overhead are identified 
and methods for estimating each of them are 


presented. On the basis of this analysis scheme, 
the proposed distributed BB algorithm is 
evaluated. An example for the use of the 


technique is also given. 


Enumeration Methods: 


Many variations of enumeration techniques have 
been designed for different kinds of problems. 
On the theoretical (abstract) level, enumeration 
methods, or backtracking algorithms as they are 
frequently called, are generally divided into two 
types: 
- Methods for enumerating all the solutions 
of problems that have more than one feasible 
solution, and 
Methods for finding the solution of 
problems that have only = one_- correct 
solution. 


Discrete optimization problems belong to the 
first kind. They are characterized by the 
existence of a cost function that is used to 
evaluate the cost of every feasible solution 
generated during enumeration. Frequently, this 
evaluation is done with the purpose of finding 
the minimum cost solution (the optimal solution). 


In this paper, our interest will be focused on 
discrete optimization problems. However, the 
algorithms used to solve’ these discrete 
optimization problems can be extended without 
much difficulty to include other types. of 
enumeration problems. 


There are three well-known methods that were used 
to solve many discrete optimization problems 
using single computers. They are: 

Integer Programming 

Dynamic Programming (DP) 

Branch and Bound (BB) 


The following discussion aims at the suitability 
of each of these methods as a basis fora 
distributed enumeration technique that works on 
network computers. 


Integer Programming In integer programming a 
set of decision variables is defined and the 


problem is formulated as an objective function of 
these variables to be minimized (or maximized) 
and a_ set of constraints to be observed for the 
generated values of these decision variables. 
Commercial packages are available for providing 
solutions for problems formulated in this way 
provided that the model is linear (Mixed Integer 


Linear Programming or MILP for short). Many 
optimization problems (e.g. the partitioning 
problem mentioned above) is more naturally 
expressable, however, as a nonlinear integer 
programming problem. Most of them can be 
linearized (for example see [El-Dessouki and 
Huen, 1977] for a linear model of the 
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partitioning problem), but the resulting models 
are ill-conditioned and their numerical stability 
cannot be guaranteed. Furthermore, many 
supplementary methods for pruning the _ search 
Space in MILP packages are actually BB 
techniques. Hence, it is quite logical to 
investigate how to distribute BB techniques 
first. In addition to BB techniques, methods for 
distributed matrix operations are needed. It is 
not clear at this point that these distributed 
matrix operations can be designed without a 
considerable amount of communication overhead. 


Dynamic Programming: The underlying search 
space for the enumeration algorithms can be 


modeled as a finite tree of partial solutions. 
The objective of these techniques is to locate 
and identify an optimal solution while explicitly 
searching a small portion of the whole tree. As 
Kohler and Steiglitz point out in ([Coffman, 
1976], Dynamic Programming and Branch and Bound, 
have recently emerged as the principal general 
methods for finding solutions to discrete 
optimization problems. DP essentially searches 
breadth-first and uses dominance rules to prune 
search tree nodes. The main advantage of DP is 
that breadth-first search provides a global look 
at the search space, allowing quick discovery and 
elimination of dominated nodes. Breadth-first 


searching has been the basis of enumeration 
techniques designed for some multiprocessor 
systems. For example, Marshall [1977] studied a 


breadth-first enumeration technique for PEPE 
(Parallel Element Processor Ensemble). There are 
significant differences, however, between 
multiprocessor machines and network computers so. 
that distributed approaches adopted for 
multiprocessors have serious drawbacks in network 


computers. There are mainly two reasons behind 
this facts: 
1. In a multiprocessor all the CPUs are assumed 


to be capable of addressing all the memory in the 
system without much difficulty. In a network 
computer the speed of accessing information 
heavily depends on whether this information is 
located in the local memory of the computer or 


not. Thus, information sharing may be achieved 
relatively easily in a multiprocessor that 
contains common memory, but it is extremely 


inefficient in a network computer. 


2. Most of the parallelism provided by a 
multiprocessor, like PEPE, is on the arithmetic 
and logic unit level (PEPE has 288 ALUs). In a 
network computer, however, parallelism is 


provided mainly in the program control unit (PCU) 
level [Handler, 1977]. Every computer in the 
network is capable of interpreting and executing 
its own instruction stream. Thus, in a _ parallel 
computer like PEPE or the ILLIAC IV for example, 
it was very attractive to design algorithms in 
the form of successive stages that consist of 
many similar operations that are executed on 
different data sets simultaneously by different 
AlUs. Processes in this form of parallel 
processing are executed by different ALWs in a 
synchronous mode performing one stage of 
computation at a time and creating more processes 
(sons) for the next stage. Consequently, 


breadth-first was a very natural choice as a 


basis for parallel enumeration on 
multiprocessors. In network computers, however, 
the amount of information needed _ to be 


communicated by every module to the process that 
prunes the tree in every stage is huge and causes 
a drastic slow-down in speed. 
of the generated partial solutions tree grows 
exponentially before generating any solution. 
Actually, solutions can only be generated if 
enough memory is available to hold the entire 
solution's tree. This creates a serious problem 
in a network computer in which the amount of 


memory available in each computer is usually 
small and always limited. The problem is that 
when the search is stopped before reaching the 
final stage, not a single solution can be 


obtained and all the search effort is lost. 


Branch and Bound: Branch and Bound is a 
general technique for backtrack enumeration that 
refers to a number of algorithms. The algorithms 
that are subsumed in the literature under this 
term, may differ widely in the ways used to 
arrange enumeration and the pruning of the search 
tree. The feature that distinguishes them from 
other enumeration techniques is the way they 
attempt to accumulate information about the 
optimal solution and the use of this information 
in pruning the search tree. Information 
regarding the most recently known best solution 
is accumulated in the form of upper bounds. A 
set of lower bounds on the expected quality of 
the solutions generated from active nodes is used 
to eliminate nodes of the search tree. When 
compared to DP, BB has many attractive features 


that make it more suitable as a basis’ for 
designing enumeration algorithms for network 
computers. These features include: 
1. BB tree search mechanisms are more 
flexible. They can be changed from 
breadth-first to depth-first or even a 
combination of both by choosing the 
appropriate branching policy. In fact, as 


Steiglitz and Kohler point out in [Coffman, 
1976], given a sequencing problem and a [PP 
algorithm to solve it, an equivalent BB 
algorithm can also be found. 

2. BB can generate complete solutions very 
quickly, specially if a depth-first strategy 
is used. These solutions may not _ be 
optimal, but they are of known (bounded) 
quality. More importantly, the quality of 
the required solutions can be _ adjusted 
according to the user needs. 

3. The tree search can be stopped as soon 
as the computational facilities (time or 
storage) are exhausted without losing the 
generated solutions. This property has a 
Significant value in network computers that 
have limited memory in each node and no 
virtual storage capability. 

4. The most important and attractive 
feature of BB, however, is that the amount 
of global information needed to describe the 
state of the search is less than that in DP 
and is adjustable. The state of the search 
is described in each stage by the most 


Moreover, the size 
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recent upper bounds and the lower bound on 
the quality of the solutions generated from 
the active node. It becomes natural to 
think about a set of processes working 
independently on different parts of the 
search tree in an asynchronouS mode and 
exchanging information regarding the most 
recent bounds. This style of loosely 
coupled parallel computation is much more 
suited to network computers. 
To effectively compare the different BB 
algorithms, it is necessary to establish a 
general classification scheme. Following Kohler 
and Steiglitz [Coffman, 1976], BB algorithms can 
be classified according to nine parameters (B, §S, 
E, F, D, L, U, BR, RB). The nine parameters are 
defined as follows: 
l. B is the branching rule that defines the 
scheme of generating the sons of each node in 
every stage of the algorithm. 
2. S is the selection rule that is used to 
select the next branching node from the set of 
currently active nodes. The selection rules that 
will be studied in this paper are: 
a. S = LLB (leaSt lower bound rule) - 
Select the currently active node with the 
least lower-bound cost. 
b. S = FIFO (first-in first-out rule) - 
Select the currently active node that was 


generated first. This leads_ to a 
breadth-first strategy. 
c. S= LIFO (last-in first-out rule) - 


Select the currently active node that was 
generated last, provided it is not a 
completed solution (leaf). This gives a 
depth-first search strategy. 
d. S = DF/LLB (depth-first/least lower 
bound rule) - From the set of most recently 
generated active sons, select the son with 
the least lower bound cost, provided this 
son is not a completed solution (leaf). 
3. F is the characteristic function used to 
eliminate nodes known to have no completion in 
the set of feasible solutions. 
4, Dis the dominance relation defined on _ the 
set of partial solutions and used to eliminate 
nodes (dominated nodes) from the search tree 
before extending them. 
5. L is the lower bound function that assigns to 
each partial solution a real number representing 
a lower-bound cost for all complete solutions 
that can be generated from it. 
6. U is the upper-bound cost which is actually 
the cost of some complete solution known at the 
beginning of the algorithm. At each stage of the 
algorithm, U is updated to incorporate the least 
cost known solution up to that stage. 


7. E is the set of elimination rules that use D, 
U, and L to eliminate newly generated and 
currently active nodes. The most famous 


elimination rules are: 
a. U/DBAS (upper bound tested for dominance 
of descendents of branching node and members 
of currently active set) - If the lower 
bound of a newly generated descendent of an 
active node exceeds U, eliminate it before 
it becomes active. If as a result of this 


operation, the lower bound of the active 
node itself becomes greater than Us, 
eliminate that active node from the set of 
currently active nodes (e.g. when all sons 
have been investigated and additional and 


improved information regarding L was known 
as a result). 
b. AS/DB (active node set tested for 


dominance of descendants of branching node) 
- Using D, each node in the active set is 
tested for dominance of each descendant of 
the branching node. If dominated, the 
descendant is eliminated before being 
considered active. 
c. DB/AS (descendants of branching node 
tested for dominance of currently active 
node set) - Each descendant that passed the 
preceding test in b and became active is 
tested for dominance of each currently 
active node. Each dominated node is removed 
from the current active set of nodes. 
8. BR is a real number between zero and one 
representing the desired maximum relative 
deviation of the optimal cost from an acceptable 
solution. 
9. RB is the resource bound vector whose 
components are upper bounds on the _ total 
expendable execution time and the usable storage 
for active nodes and immediate descendants of the 
branching nodes. 


Conceptually, in each stage of a BB algorithm, a 
"list" of the objects that can be permuted to 
generate all the possible alternatives for 
feasible solutions is constructed. The algorithm 
Selects one "attractive" alternative using the 
rule S. This operation can be considered as a 
Selection of one node of the search tree. Then 
the algorithm generates the direct descendants of 
this node using B. It then computes L of these 
new nodes and uses D and E to detect and 
eliminate dominated nodes. It then updates the 
set of currently active nodes. When a complete 
solution (a leaf) is generated, U is updated and 
used together with BR to check whether an 
acceptable solution has been reached. The 
algorithm stops if such a solution is reached. 
In every stage the RB vector is used to determine 
whether the available resources have _ been 
exhausted or not. If so the best recently know 
solution is given before halting. 


From the above description it is clear that all 
BB methods generate the whole search tree at some 
point. This is unavoidable, otherwise the 
optimality of the generated solution cannot be 
guaranteed. This is another way of stating the 
fact that the problem in hand is NP-complete. 
There are two desirable considerations, however, 
that make one BB algorithm more attractive than 
another for a network computer. First because of 
the limited memory in each node, the search tree 
Size should grow gradually (i.e. in a= rate 
comparable to the rate of generating solutions) 
and the whole tree should be generated at the 
latest possible time after generating a number of 
complete solutions. Second, the BB algorithm 
should have low communication requirements. 
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These two features will be the basis of designing 
a distributed BB algorithm in the next section. 


Distributed Branch and Bound: 


One way to distribute the BB algorithm operations 
described above on a number of computers is to 
give every computer a "share" from the list of 
objects used to generate the solutions. Every 
computer, then, can start to investigate its own 
share. The most important point in the algorithm 
proposed here, however, is to let each computer 
by itself generate the list, identify its own 
share, and limit its investigation for a solution 
within this share. The computers, then, can 
exchange, via messages, global and time dependent 
information such as the most recent upper bound, 
in order to enhance the search for the optimal 
solution. It must be noticed that the process of 
generating at least a part of the required lists 
mentioned above is repeated in more than one 
computer. 


This scheme has two advantages. First, in a 
network computer environment, it is more 
efficient to let every computer search a complete 
subtree independent of the others. Even if 
subtrees are not completely disjoint and some 
computations are repeated in more than- one 
computer, still the communication time saved 
makes this method more attractive and efficient 
than having a single computer for each node in 
the decision tree. AS a quick example, 
duplicating the computations corresponding to the 
root of a decision tree in all the computers of 


the network does not increase the load _ of 
individual computers because each of them must 
wait anyhow for the result of the root 
calculation, but this scheme saves the 


communication time required to send the result of 
this computation to all the required receivers. 
Second, the problem of letting a single computer 
take the decision of splitting up the load among 
a number of processors is avoided. This is 
particularly important in applications where it 
is difficult to determine how to divide the 
Search problem before generating the actual 
permutations. In these cases, some permutations 
can be ruled out very quickly as not feasible 
using the characteristic function F mentioned 
earlier, for example. But this can only happen 
after generating these permutations. In these 
cases the process of dividing the search load 
requires a significant amount of computations. 


In the proposed scheme this process is’ shared 
among the computers of the network. 
In order for a computer to be able to identify 


its "share" in the enumeration, every computer 
containing an enumeration subprocess is given an 
identification number, MYNUMR, The enumeration 


subprocess must also know the total number of 
cooperating parallel subprocesses’ in the 
enumeration job. This latest number is called 
NCP, 


During enumeration, each subprocess is in one of 
the following phases: 


Phase 1: Selection phase. 
Phase 2: Full enumeration phase. 
Phase 3: Exchange phase. 


In each phase, each subprocess 
follows: 


operates as 


Phase 1: Selection Phase: 

In this phase, the number of nodes in the search 
tree in each stage under consideration is less 
than the number of cooperating subprocesses NCP. 
Since every enumeration tree begins with one 
root, this is always the initial phase. 


At every stage each subprocess constructs’ the 
list of objects that can be permuted to generate 
the partial solutions. The subprocess then 
selects one node from the decision tree by 
generating one specific permutation of these 
objects in the list. This same node may be 
chosen by more than one subprocess since the 
number of cooperating subprocesses is greater 
than the number of possible permutations. The 
distribution of nodes is made using a circular 
scheme such that if the number of nodes generated 
in a stage in the selection phase is denoted by 
SIZE, then subprocess numbered MYNUMR is assigned 
node number g where: 
g = [(MYNUMR — 1) modulo SIZE] + 1. 


The subprocess calculates new values for MYNUMR 
and NCP as follows: 
rl if g< [ (NCP-1) modulo SIZE]}+1 
ee 
0 Otherwise 


MYNUMR = [MYNUMR / SIZE| 

NCP = [NCP / SIZE] + DELTA 

Then the process reenters the selection phase at 
the next stage. 


This phase ends when SIZE becomes equal to or 
greater than NCP. The parameter SIZE, which 
indicates the number of possible permutations at 
each stage, can become larger than NCP at any 
stage of the selection phase. When this happen, 
every process picks up a number of permutations 
equal to  |SIZE/NCP]. The last process 
arbitrarily gathers the remainder of the load 
when SIZE is not divisible by NCP. All processes 
then enter full enumeration. 


Phase 2: Full Enumeration: 


When the number of alternative permutations of 
the generated list of objects reaches NCP, each 
Subprocess can choose a permutation, being sure 
that it is not shared with any other subprocess. 
Each subprocess can proceed to search the 
generated subtree in a normal BB fashion since it 
has a unique subtree of its own. 


Each computer in the network should include three 
tasks that cooperate to perform enumeration and 
communication with other processes in the other 
computers. These three tasks are called: 


The Main enumeration task, M, 
the Receiving task, R, and 
the Transmitting task, T. 


R receives the messages, updates the local 
information regarding enumeration (e.g. the most 
recent value of UB), and may activate special 
procedures in its node (e.g. a procedure to send 
information regarding unsearched parts of the 
local tree as will be explained in the exchange 
phase). T formats and sends’ the required 
messages. 


In this section, different BB algorithms are 
examined uSing the nine-tuple characterization 
presented earlier in order to compare them aS a 
basis for full enumeration in our distributed 
environment. The algorithms are analyzed on the 
basis of the amount of communication overhead, 
the effect of limited memory in each node, and 
the utilization of the parallel processing 
capability of the network. 


Information regarding the two parameters L, the 
lower bound, and U, the upper’ bound will be 
exchanged between various nodes~ sharing in 
enumeration. Consequently, these two parameters 
are assumed to be accessed and/or modified by the 
transmitting and receiving tasks, T and R, in 
each node in a way that will be discussed 
shortly. 


The four parameters: the branching rule B, the 
characteristic function F, the resource bound 
vector RB, and the bracket BR, are in fact user 
and/or problem dependent parameters. They can be 
incorporated ina direct way in each main task M 
on each node. A simple modification for the 
definition of these four parameters is that they 
should all be defined on the local set of tree 
nodes generated in each processor. Thus, for 
example B becomes the rule that is used _ to 
generate the sons of an active node among those 
stored in the processor in which task M is 
executing. 


The main problem to be discussed now is the kind 
of S, D, and E rules that should be used and 
their effects on the communication overhead, 
speed, and storage. 


Two general types of S, D, and E rules can be 
proposed for the distributed algorithm: 


Type 1 : Global rules 
Type 2 : Local rules 


In the first approach, the parameters are defined 
in the same way aS in the uniprocessor case e.g. 
the dominance relation D is defined for all pairs 
of generated solutions irrespective of the node 
in which they are generated. A second example is 
a global elimination rule in which partial 
solutions are eliminated if they are dominated by 
any partial or complete solution in any proCessor 
in the network. In the second approach, the 
rules are defined to operate on the partial 
and/or completed solutions generated within the 
processor only (local solutions). As an example, 


a local selection rule would be to select the 
node with the least lower bound cost among the 
local set of active nodes in each computer. 


To compare the two approaches, it is first 
observed that a global selection rule has no 
advantage over a local one. A global selection 
rule may let one computer to work at a time (the 
one which has the "selected" node) resulting in 
very poor utilization of the parallelism in the 
network. Aother alternative of global selection 
is to select a set of nodes and send information 
regarding these nodes to a set of computers to 
allow them to search in parallel. This may well 
lead to a situation in which the information 
regarding the whole enumeration tree is sent many 
times around the network causing a prohibitive 
communication overhead. Thus, a local selection 
rule as the one mentioned earlier is a clear 
choice. 


The advantage of a global dominance relation and 
an elimination rule based on it is that it may 
increase the probability of discovering dominated 
solutions earlier in the enumeration and hence 
reduce computation time. However, this fact is 
not necessarily true. For every global dominance 
relation, D, a local dominance relation, D' can 
be found such that D'C_ D. This can be achieved 


by using the same definition of D on _ the 
restricted local set of solutions. Kohler and 
Steiglitz [1976] proved that using a stronger 


dominance relation like D instead of D' may 
increase the computational requirements of a BB 
algorithm (theorem 6.3 p 256 of [Coffman, 1976]). 
Moreover, the communication overhead incurred in 
exchanging the information needed to test every 
pair of solutions will further slow dow 
Significantly the resulting algorithm. As a 
result the local approach is chosen as a basis to 
design the distributed BB algorithm in this 
paper. Thus, the dominance relation and all the 
elimination rules based on it are defined for the 
local set of solutions in each computer. 

A final point in is the 


this comparison 


difference between various selection rules. A 
Significant advantage of a local LIFO 
(depth-first) and/or DF/LLB (depth-first least 
lower bound) rules over a local FIFO 


(breadth-first) and/or LLB rule in a network is 
that in the first set of rules many complete 
Solutions (one in each computer) are generated 
quickly. Since the value of the most recent UB 
found in each computer is exchanged among all the 
computers via messages, this should result in 
increaSing the chance of improving the UB value 
at each stage. This in effect can be thought of 
as uSing a tighter UB in each computer which, as 
indicated in the studies of uniprocessor BB 
algorithms, is a significant factor in speeding 
up enumeration. Moreover, in depth-first search, 
the UB contains all the necessary information 
needed to be retrieved during successive stages 
of the enumeration. The storage space of 
complete solutions can be reused in generating 
new solutions. The full specifications of the 
best known complete solution is not needed during 
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enumeration and can be stored on a_ secondary 
Storage file, for example. This fact has a 
special significance in network computers with 
limited memory in each node. 


Thus, in the distributed BB algorithm proposed in 
this paper the main process M in each computer 
searches its own subtree using a depth-first 
selection rule. When a complete solution is 
generated the UB in that computer is updated and 
the Transmit task, T, is activated to "broadcast" 
the upper bound value to all the other computers. 
When a meSsage concerning a new value of UB is 
received in any computer, the Receiving task, R, 
is activated. It tests the received UB value and 
uses it to update the local variable that 
contains the most recently known UB. It should 
be noticed that the location of that local UB is 
accessed by both M and R_- asynchronously. 
Consequently, this location must be considered a 
critical resource and appropriate mechanisms for 
mutual exclusion should be used. 


Phase 3: Exchange Phase: 

The exchange phase is entered when an enumeration 
process p completes the search for a solution in 
its own subtree and terminates its active phase 
of full enumeration, while some other 
subprocesses are still searching their subtrees. 


Process p declares its new status and initiates a 
dialogue aiming at redistributing the remaining 
enumeration load. Essentially, process P 
Switches to search a new aS yet  unsearched 
subtree that was originally a part of the load of 
another process gq. In order to achieve this, the 
following steps must be taken: 


1. Process p must identify an appropriate 
process q to share the enumeration load 
with. 

2. Each process in the system must 
identify the wumsearched part of its 
subtree. 

3. The information necessary to switch 


from one subtree to another must be 
specified. Process q sends a message _ to 
process p including such information. 

4. Process p should carry most of the 
overhead load associated with the 
switching operation, since it is the idle 
process. 

5. Process q must identify the subtree 
given to process p and updates its search 
space properly by eliminating this subtree 


from it. 
6. A "thrashing" situation in which the 
processes spend most of the time 


exchanging subtrees should be avoided. 


AS waS mentioned before , each process at _ the 
beginning the selection phase knows its own name, 
MYNUMR, and the number of computers in the 
network, NCP. The initial values of these two 
parameters are called MYNUMRO = and NCPO, 
respectively. Every process in the system is 
assumed to be identified by its MYNUMRO. Thus, 


any message in the system should be associated 
with two parameters, the values of MYNUMRO of the 


sender and the receiver. This identification 
should act as a channel number for the 
communication network. Broadcast messages 


mentioned before, however, are treated specially 
and may not include this kind of information. 
(It must be noticed that the values of NCP and 
MYNUMR are updated in every stage of the 
selection phase. Consequently, their final 
values will, in general, be different from NCPO 


and MYNUMRO). Hence, a process p can identify 
its nearest neighbor, p*, as: 

p* = [(MYNUMRO + 1) modulo NCPO] 
Note that this scheme makes process 1 in the 


enumeration the successor of process number NCPO. 


When process p enters the exchange phase, it 
starts asking the other processes in the system 
one at a time beginning with p* for a share of 
their enumeration load. If p* does not want to 
give any part of its load, process p asks the 
nearest neighbor of p* and so on. If no process 
wants to give any part of its load, process p 
informs the group that it is quitting and it 
halts. Otherwise the first process that shows 
willingness for exchange is identified as q. 


The next step is to design the mechanism by which 
each process can identify the unsearched part of 
its own subtree. To do that, it must be noticed 
that phase 2 incorporates a depth-first search 
strategy. Moreover, one of its main 
characteristics is that each enumeration 
subprocess is searching a complete subtree. In 
order to maintain these desirable features, 
process p_ should switch to the upper-most nodes 
(i.e. one closest to the root) in the unsearched 
part of q's subtree. Furthermore, p should pick 
up the subtree that will be searched last by q. 
To achieve this while avoiding the occurrence of 
a thrashing situation the following steps are 
executed: 
1. Each process in the system (like q) 
maintains a pointer, UNSEARCH, to the last 
son of the upper-most root of its unsearched 
subtree. UNSEARCH can be initialized at the 
point of entering full enumeration. Two 
cases must be considered at the last stage 
of the selection phase. If SIZE became 
greater than NCP and the share of each 
computer at the moment of entering full 
enumeration is greater than one tree node 
(i.e. when |SIZE/NCP| > 1), UNSEARCH should 
be initialized to point to the last node of 
each computer share. Otherwise, UNSEARCH 
should be initialized to point to the last 
"son" of the only node picked by the 
computer while entering full enumeration. 
2. When a process begins the search in the 
subtree with root UNSEARCH, the value of 
UNSEARCH should be updated to point to its 
last son. 
3. When a process receives a message asking 
for a share of its enumeration load it tests 


whether the node pointed to by UNSEARCH has | 
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brothers that are active but not yet 
branched from. Tf there is, the process 
sends the address of UNSEARCH to the 
requesting process p (i.e. it becomes q). 
If there is no brothers for the node 
UNSEARCH, however, the process initiates a 
branching from UNSEARCH if more than one son 
results from the branching it passes the 
address of the last son to p. Otherwise, 
the process expresses its refusal for 
exchange (the remaining load is too little 
and it is better not to send it around). 
UNSEARCH is updated to point to the last son 
remaining in the processor. 


The advantage of this scheme is that the free 
computer, p, takes a share of the load 
proportional to the remaining load in q. Also, 


process q always keeps a part of that remaining 
load in order to avoid another exchange in case q 
is about to enter its unSearched part of the tree 
when it received the exchange message from p. In 
this way unnecessary exchanges of tree nodes are 
substantially reduced. The next step is to 
Specify what information must be exchanged 
between p and gq _ before a new active phase can 
begin in p. Since each process in the network 
has a copy of the original set of objects that 
can be permuted to obtain partial solutions at 
each stage, process p can resume enumeration at 
any node of the search tree if it is given the 
description of the partial solution corresponding 
to that node. This information must be supplied 
by q. After sending this information to p, 
process q eliminates that partial solution from 
its search space and update UNSEARCH as indicated 
above. 


An example for distributing a simple enumeration 
tree on 3 processors using the above scheme is 
given below. 


Example: The original tree is shown in Figure A. 
In the first step of the selection phase each 
process generates the initial root node, 
identifies its neighbor, and performs the _ root 
computations as shown in Figure B. 

In the second iteration of the selection phase, 
the alternatives examined in each process and the 
computation of g are shown in Figure C. 

Based on the value of g, processes Pl and P3 
"select" node 2, and process P2 "selects" node 3 
as shown in Figure D. The values of MYNUMR and 
NCP are updated accordingly. 

In the next iteration, process P2 realizes that 
no other process is sharing with it node 3 (NCP 
became 1 which is less than or equal to the 
number of entries in the list {3} selected by 
process P2 at this point). Thus, P2 enters full 
enumeration phase and identifies node 9 as_ the 
uppermost root of the "“unsearched" part of its 


tree. Processes Pl and P3, however, begin a new 
iteration in the selection phase as shown in 
Figure E. 


At this point, both processes Pl and P3 realize 
that the number of nodes in {4,5,6} is 3 and Pl 
"picks-up" node 4 while P3 "picks-up" 5 and 6, 
Fach process stores a value for UNSEARCH and 


enters full enumeration phase as show in Figure 
F. 

Each process continues full enumeration of its 
subtree until, for the purpose of completing this 
example, a point is reached when Pl finishes its 
full enumeration phase and wants to share in the 
remaining enumeration load. The status of the 


enumeration is assumed to be as show in Figure 
G. 
Process Pl recognizes that 2 is the process 


number of its "neighbor". Hence, Pl initializes 
an exchange phase with P2. After the exchange is 
complete, the new status of the enumeration is 
shown in Figure H. 


It is clear that in a real application the trees 
are much bigger than the one shown in the example 
and one or more exchange phases may be initiated 
by various processes in the system. 


Algorithm Analysis: 


In this section, the performance of the 
distributed BB algorithm presented above is 
analysed. In recent research ({Agerwala and 
Lint, 1978], {Baudet and Stevenson, 1978] and 
others), it has been pointed out that the 
previous work on the design of parallel 
algorithms for SIMD machines largely ignored 


communication delays. 
algorithms for machines 


New methods for analyzing 

in which communication 
overhead cannot be ignored are beginning to 
appear in recent research. The following 
analysis considers the computational complexity 
and the communication overhead of the distributed 
enumeration algorithm. 


Before examining the algorithm, the computation 
model on which the analysis is based must be 
described. As was mentioned earlier, the system 
architecture is a network computer consisting of 
multiple processors each having its own local 
memory and control. There is no shared memory in 
the system. Computers are connected by a fixed 
topology network. The time to send a unit of 
information between two computers is proportional 
to the distance between them (i.e. the number of 
links or data movements required between them). 
A linear cost criterion [Agerwala and Lint, 1978] 
is used to define message length in this paper. 
Under this criterion a message can be considered 
of length equal to an integral multiple of some 
basic unit. It is further assumed that one 
processor can broadcast one unit of information 
to all the processors of the network. The time 
to send such a broadcast message is proportional 
to the number of computers in the network. 


In the distributed BB algorithm presented above, 


processors are assumed to execute in an 
asynchronous manner. Three modes of interaction 
among computation and communication can be 
directly identified: 

1. In the selection phase each processor 
executes its computation using local data only. 


No interaction or communication of any kind 
between different computers takes place during 
this phase. 
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2. In the full enumeration phase, each 


processor's main enumeration task M executes its 
computations in the form of a loop. In _ each 
iteration (stage) of the loop selection, 
branching, and elimination steps are executed. 
The main task does not wait for messages from any 
other processor. When a message is received, 
however, the receiving task Rin that processor 
is activated. As a result, tasks R and M share 
the same CPU until the message is processed. 
Also, when a message needs to be sent, task T is 
activated and it takes its share from the 
processor time. It is therefore clear that the 


communication delay in this case overlaps with 


computation completly. The communication 
overhead is directly proportional to the 
processing time of T and R. For the purpose of 


it is convenient to measure this kind 
as: 


this paper, 
of overhead 
NT*YT + NR*YR 
where NI is the number of times T is activated, 
YT is the amount of processing done by T 
each time it is activated 
NR is the number of times R is activated, 
YR is the amount of processing done by R 
each time it is activated. 
3. In the exchange phase, the processes involved 
in the exchange operation can be modeled using a 
Similar technique as suggested in [Agerwala and 
Lint, 1978] i.e. each process is assumed to he a 
sequence of non-overlapping cycles of computation 
followed by communication. The communication in 
this case is carried out by the network and is 
modeled using the linear model discussed at the 
beginning of this section. 


With this model defined, the distributed BB can 
be discussed. Given a network of N computers and 
an enumeration problem in which the number of 
possible solutions in stage i is given’ by 
SIZE(i), the selection phase ends at stage j at 
which: 
SIZE(j) > N 

and the number of tree nodes generated in the 
selection phase is: 


j 
(N41) < } | SIZE(i) < 2N 
i=1 - 
In this phase each computer is working to 
BSRS ease khs, pata Bei oft § Such bh erMe 1g27g8. 


than or equal to [log N}. 


Thus, the speed-up obtained in this phase is 
given by: 

S > N/[log NI 
In the full enumeration phase, the number of 


solutions generated in each computer, 
bounded by: 


Q can be 
1 <9 < SIZE (FINAL) N 


where FINAL is the number of stages in the whole 
enumeration tree. Thus, the number of messages 
transmitted by each node is Q and the number of 
messages received is less than (N-1)9. This is 
in fact a very pessimistic estimate on the number 
of messages sent and received in every node. It 
is assumed that the computer that generates the 
worst solution is the fastest one and _ the 
computer with the next worst solution is’ the 
second fastest and so on. Many of these messages 
in a real case will not be sent since some of the 
good solutions may be arrived at before the bad 
ones. The probability of obtaining good 
solutions first is better in the case of a 
network of independent computers working 9 on 
different subtrees simultaneously (as in the 
above distributed BB algorithm) than the case of 
a Single processor. This probability also 
increases with the number of processors N. A 
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better estimate for the_ number of received 
messages may therefore be nq, where g iS a 
fraction between o and l. 

Each received message cauSes UB to be updated. 


This operation should not take more than an 0O(1) 
time units. Also, sending a meSsage with a new 
UB is another O(1) operation. Consequently, the 
amount of communication processing is bounded by 
o(wG1). In order to estimate the amount of 

processing done by the main enumeration task M in 
every stage, it is assumed that the number of 
active tasks to be considered for dominance and 


elimination processing uSing E and OD _ is 
proportional to the number of branching 
operations done up to that stage. Thus, the 


processing done by the main enumeration task in 
every satge can be bounded by: 
[flog (SIZE (FINAL) AN)1]° 

where c > 1 (depending on the sophistication of 
the elimination rule used, c may vary between 1 
and 3 mostly). Consequently, the total amount of 
processing in all the stages done by the main 
task M may be estimated as: 


[log (SIZE (FINAL) /N) ] ©*2*SIZE (FINAL) AN 


The factor 2*SIZE(FINAL)/N represents the upper 
bound on the number of tree nodes visited by one 
computer i.e. the number of branching operations 
executed. 

For a uniprocessor the amount of processing 
should be 4 

[flog SIZE(FINAL)1]~ * SIZE(FINAL) Thus, the 
speed-up in the full enumeration phase is given 
by: . 

calling SIZE(FINAL) SF we get 


S > N/[(1- (log N/log SF))“+N*/(1og SF)” ] 


the case c=q=l, this expression reduces to: 
N/{1 + (N-logN) /log (SIZE (FINAL) ) ] 


For 


In the exchange phase, every computer begin 
asking its nearest neighbor for a share in 
enumeration. If it is considered very rare for a 


neighbor to refuse an exchange, then the distance 
of all the messages in this phase can be 
considered unity. As waS mentioned before in the 
description of the exchange protocol, process q 
may have to execute a branch operation and an 
elimination operation for one node. Also, the 
amount of exchanged information is essentially 
equal to the size of one partial solution. All 
these operations take an amount of = time 
independent of the number of computers in the 
network and the number of nodes in the 
enumeration tree. ConSequently, the overhead 
associated with the exchange will be bounded by 
the number of exchange operations that occur. If 
the amount of processing needed in each 
enumeration subtree is roughly the same, the 
number of exchange operations should be small. 
However, it will greatly depend on_ the 
distribution of actual load on the computers, a 
factor that does not only vary on a_ statistical 
basis but also does depend on.the nature of the 
problem. 


Conclusions: 


A distributed enumeration algorithm has been 
presented that has the following characteristics: 
1. It consists of a number of loosely coupled 
processes that operate in parallel. Each process 
enumerates a complete subtree of the enumeration 
space. A depth-first BB technique is used to 
enumerate all feasible solutions. 


2. The information exchanged betwéen various 
processes is reduced to one value per complete 
solution (the most recent upper bound), to 
minimize the communication overhead. This-was 
Shown to be the most attractive feature of 
depth-first BB aS compared to all other 
enumeration techniques. The communication 


overhead is reduced in some phases of the 
algorithm by duplicating some of the computations 
in more than one process. These 
duplicatedcomputations do not increase the load 
of any process sharing in the enumeration. 

3. The technique is general enough to be applied 
in many enumer ation problems and in 
multiprocessors as well as network computers. 


The reduced communication overhead makes it 
specially attractive for the network computer 
case. In a multiprocessor with interleaved 


memory, this feature can be used to reduce memory 
conflicts. 

A method for analyzing distributed algorithms was 
discussed and used to analyze the performance of 
the proposed algorithm. It was shown that the 
nature of the communication overhead can vary 
depending on the way asynchronous processes are 
designed to interact together. One case was 
studied in which the amount of communication 
overhead does not depend on the characteristics 
of the communication network but rather on the 
complexity of some message handling task in each 
computer of the network. Finally, it was shown 
that the speed up gained by the _ proposed 
algorithm can reach the number of computers in 
the network for trees with huge sizes. 
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THE MULTISENSOR DATA CORRELATION AND HANDOVER PROBLEM 
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Huntsville, Alabama 35805 


Summary 


The situation in which N items (men, radar 
tracks, jobs) must be assigned or paired with N 
other items (tasks, incoming objects, computers) so 
as to minimize some measure of the pairing is called 
the Assignment Problem (AP) [1,2]. An algorithm to 
determine the optimal pairing is well-known and has 
been studied extensively. The algorithm is a modi- 
fication of a branch-and-bound tree search and takes 
advantage of the special structure of the problem. 


The multisensor data correlation and handover 
problem occurs when object tracks pass from one 
radar's field of view to another's. In order to 
provide the maximum information transfer, the two 
radars must be able to identify the same objects 
when observed from two different aspects. This is 
an assignment problem. 


The assignment algorithm is a sequential 
process, and execution times tend to be on the order 
of N3 [3]. For a large number of objects, such as 
the ballistic missile defense problem, algorithm 
execution times are excessive, and means to obtain 
solutions quickly must be determined. Several ways 
to reduce the AP times have been explored. One 
approach is to use a non-optimal heuristic [4]. 
Execution times for the heuristic are on the order 
of N2. However, the number of incorrect assignments 
made makes this approach unacceptable. 


Another way to reduce the AP time is to take 
advantage of the parallelism found in the sequential 
algorithm. A study was conducted using the Parallel 
Element Processing Ensemble (PEPE) to determine any 
advantages found by solving the AP on a parallel 
associative processor. A testbed of approximately 
50 problems was generated. These problems ranged in 
size from 3 to 12 objects. Included for each size 
were several randomly generated problems and the 
sequential algorithm "worst case'' example. The 
worst case runs in N* time. 


These test problems were then solved by four 
different means to compare the relative solution 
times and solution accuracies. The methods used 
were (1) the sequential algorithm, (2) the sequential 
heuristic, (3) the parallel algorithm, and (4) the 
parallel form of the heuristic. Test results for 
execution times are shown in Figure l. 


As reported in the literature, the execution 
time for the sequential algorithm is a function of 
N3. The parallel form of the optimal algorithm 
gives an execution time proportional to N“. The 
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sequential heuristic runs in N2 time while the 
parallel heuristic runs in a time proportional to N. 


24 
* 
« SEQUENTIAL OPTIMAL Pa 
20 Oo SEQUENTIAL HEURISTIC 2 
Q& PARALLEL OPTIMAL Je 
16 O PARALLEL HEURISTIC * 


TIME SOLUTION 


NUMBER OF OBJECTS 


Figure 1. Test Results 

The sequential heuristic was found to produce 
a large number of errors in the pairings. In 
addition, the times for the heuristic were longer 
than those for the parallel algorithm even though 
both ran in N¢ time. Thus, the parallel algorithm 
provided the optimal solution while execution 
times were reduced by a factor of N when compared 
with the sequential implementation. In a critical 
environment, such an improvement could prove to be 
decisive. | 
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A FRAMEWORK FOR THE STUDY OF PERMUTATIONS AND APPLICATIONS 
TO MEMORY PROCESSOR INTERCONNECTION NETWORKS* 


D. K. Pradhan? and K. L. Kodandapani 3 
School of Engineering 
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Rochester, Michigan 48063 


Abstract--This paper develops a set of syste- 
matic techniques to facilitate the study of permu- 
tations and permutation networks. First, we pre- 
sent a switching theory framework through which 
certain well-known permutations that are useful in 
parallel processing are characterized. A procedure 
is then developed, using these results, which is 
useful in characterizing certain shuffle-exchange 
networks and, in general, any permutation networks. 
This procedure is based on a technique whereby 
complex permutations are decomposed into a se- 
quence of elementary permutations. Several new 
results are thus derived. 


Introduction 


A goal of this paper is to extend, as well as 
unify, the understanding of both permutations and 
permutation networks. It is hoped that these new 
results and insights may lead to further research. 

The study of permutations and permutation 
networks has been an important topic of research 
in parallel processing [3-8]. Permutations of 
the data, as well as of the intermediate results, 
are required in order to execute the algorithms 
that are used in parallel processing. Also, the 
ability to Simultaneously access multiple data 
elements from memory is key to successful parallel 
processing. This simultaneous access is achieved 
by the use of multiple memory modules [1], where 
those data items that may be simultaneously needed 
are stored in different modules. Several permuta- 
tion techniques [3-8]. have been proposed for ar- 
ranging the data items so that conflict-free ac- 
cess to the memory modules is achieved. 

In the literature, certain permutations, such 
as uniform shift, unscrambling of t-ordered vec- 
tors, etc., have been identified as important in 
parallel processing (especially in SIMD-type ma- 
chines). These findings have influenced the de- 
sign of some of the permutation networks that have 
been developed for interconnection between memory 
modules and processors [3-8]. 

In this paper, we first introduce a switching 
theory framework of permutations. Then, using this 
formulation, we derive some interesting results 
that characterize some of the permutations that 
are known to be useful in parallel processing. We 
then develop a technique that characterizes cer- 
tain shuffle-exchange networks and illustrate how 
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this technique can be used to study any arbitrary 
permutation network. This technique is based on 
the process where a complex permutation (which is 
realized by a network in one or more passes) is 
decomposed into a sequence of elementary permuta- 
tions. 

First, we study the class of permutations 
which is admitted by the shuffle-exchange network 
proposed by Lawrie [3]. We then derive results 
which provide new insight into these permutations. 
Next, we focus our attention on the simplified 
version of the shuffle-exchange network proposed, 
subsequently, by Lang-Stone [4]. We assume a more 
general condition for these simplified networks 
and allow for all possible combinations of con- 
trol functions for different passes through the 
network. Under this generalized condition, we 
derive results which exhibit certain characteris- 
tics of the permutations which are admitted by 
these networks. Furthermore, a tighter upper 
bound on the number of permutations admitted by 
the network is derived. 

This paper is organized into two main sec- 
tions. In the following section, a switching 
theory formulation of permutations is presented. 
This is useful in the second section where a pro- 
cedure for characterizing memory processor inter- 
connection networks is developed. 


A Technique for Characterizing Permutations 
and Its Related Results 


A permutation p can be defined as a one to 
one and on to mapping from a set of integers into 
itself. The permutation p is usually repre- 
sented as {(i,p(i))|0 < i < N-1} where p(i) 
represents the mapping of i, O<i< N-l. 

In this paper, as in some of the previous 
work [3 - 8], we assume N 2" for some n. In 
the following, we introduce F, an alternate rep- 
resentation of p. 

Consider the binary numbers 0 to N-l. Let 
B. represent these binary numbers. The set B® 
consists of 2 distinct binary n-tuples. Let 
F:B°>B2 be a mapping, as defined below. 

For each i, O<i<g¢N-l, let i= (in, 
in-1r--+-,1 1) denote the binary number i. Let 
F(i) = 3 where 3 = (InrUn-1r-++0Jke-++ 32) 
denotes the binary number j, where j = p(i). - 
Thus, F is a one to one and on to mapping of B 
into B® and is the permutation p expressed in 
terms of the binary numbers. 

Now F, in turn, can be expressed as a col- 
lection of n switching functions fj1,f2,...,fk, 
Each of these functions f,, 1<k<¢n 
are n-variable functions, as defined below. 

For 1<k<n, let 


ee ae 


and let 

dee f(y AYn-y = Ayea cote YF i,) 
where j, is the qth component of the binary num- 
ber j, given by F(i) = j. 
Example l. 


Consider the following permutation. 


O O 8) O O O O O 
a 5 O O As 1 O 1 
2 6 @) ak: 0 1 1 O 
3 a 6) 1 1 1 1 1 
4 i. 1 O O O @) 1 
5 2 1 O 1 O 1 O 
6 3 1 1 @) 0) 1 L 
7 4 1 1 al 1 6) O 
= y.y + y + 
¥3 — Y3Yo¥) + ¥3¥2 + Y3¥2¥) 
=yV + y_y ms vy 
Yo = Y¥3¥o + ¥3¥2¥,) + Y3¥o¥y 
= + e 
Y 7 Y3¥) + ¥3¥%) 
Thus, we see that every permutation on 2” 


elements can be represented in terms of n switch- 
ing functions. 

In the following, we deduce these switching 
function expressions for several permutations use- 
ful in parallel processing. We reveal some inter- 
esting characteristics of these permutations. 


Perfect Shuffle Permutation 


This permutation has been shown to be useful 
in algorithms which are suitable for parallel pro- 
cessing, those designed for polynomial evaluation, 
sorting, etc. [4]. 

The perfect shuffle is defined as 


25 O<i<2 
p(i) = 
2i-N+1 2 <i.<'2 


Theorem 1. The perfect shuffle permutation results 
in the following functions. 


Ke 
N 
IA 
a 
1A 
+) 
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Proof. 
Proof follows from the fact [2] that 


p (i) 


Exchange Permutation 


This permutation has been shown to be useful 
in the design of dynamic memory [4] and in mem- 
ory processor interconnection networks for array 
processors [3,4]. 


This exchange permutation is defined as 


- 
iI 


i+1 even 
p(i) = 
i-l odd 
The following is an immediate consequence of 
the definition, and hence there is no need for a 


proof. 


Theorem 2. The exchange permutation results in 
the following functions. 


Uniform Shift 

Uniform shift is an important permutation 
for parallel processing [4,6,7]. It has several 
applications, especially in matrix manipulation. 
The basic permutation performed in Illiac IV con- 
Sists of four different types of uniform shift 
permutations. All other permutations are achieved 
by using these permutations. 

Uniform shift is defined as 

p(i) = (1 + d) mod N 

where d represents the amount of shift. 

We need the following additional definitions 
which will be useful for characterizing this and 
other permutations to be presented later. 


Definition. Let g(Xpn,/Xn-l,---,/Xj,-+-,X]) be an 
n-variable function. Then, dg/dx; = g(xy, 
Nic is neg RUeOy ae) Go Ol kash se aces eels 
-++,X 1) where G) is the exclusive-or oper- 
ator. The function dg/dx; is said to be 
the Boolean difference of g with respect 
to x. 


Definition. Let g(i) = G(X,=1y-Xp-paly-yre ees 
Xj=11) be the value of the function for the 
input (in,in-1,---,11), representing the 


binary number i. 


Definition. The function g(xpn,Xn-1,---,X]1) is 
said to be a runlength-q function if, for 
some i, g(i) = g(itl) = ... = g(itq-1) =1 
and g(i) = 0 for all other i. 


Example 2. 


FOF OF OF Oo 
C00 OF KF FP oO Oo 


0 6) 
0 6) 
0 1 
0 1 
1 6) 
L 6) 
d, 1 
uf ue 


The above function g(X3,X 9/1) = Xp Xp +X, X2xX3 


is a runlength-3 function. For this function, we 


can compute 


It may be noted that, given the ex-or sum of prod- 
ucts expression for g, the Boolean difference 
dag/dx, can be obtained by simply deleting all 
terms that do not involve x. and, also, all 
appearances of Xs. . 

Theorem 3. For uniform shift function, Y can 
be expressed as - 


oe ~ ¥, OF, Yn Ypao! ie Y,) 
The other functions Yn-1/Yp-27--++7¥] are related 


to Y, through the following recursive rule. 


TE Y, = ¥, OG, (yy Yea te Vy) 
dg, 
then Yyey = %-10a * 


Proof. 


Let (dn,dyn-1,---,dj]) be the binary-repre- 
sentation of the number d, the amount of shift. 
The permutation p(i) = (itd) mod N' corresponds 
to addition of the fixed number d to each i. 
Since the addition is performed modulo N = 2h, 
any carry from the nth position is discarded. 

The variables Yn,Yn-1,---.,Y] represent the 
sum bits produced by adding d to the number rep- 


resented by the variables ynrYn-j,---,/Y1- 
Thus, the function Y, can be expressed as 
Yn = Yn © dn ® Cn (¥n-1/Yn-20- ++ r¥1) , where Cy 


is the function representing the carry bit in to 
the nth position. However, it may be noted that, 
since d is a fixed integer for all i, C is 
expressed as a function of only the variables 
Yn-1lrYn-29+++/Y1- 

Let Qn (Yn-1/Yn-21---Y1) = dn G) Cy Gina) 
Yno2e-+r¥L)- Thus, Yn = yn@gnlyn-L/Yn-27--+) 
Yy ° 


Similarly, for any k, we can express k 


Y, ‘@) oy (Ye! o° -Y,) where 
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Te (Ya Yee! sd ry) = d Os, (YY R-2! eee rY,) 
(1) 


The function C, (yK-1/Yk-2,/---,yY1) represents the 
carry into the kth position. 
Now, one can express 


O. (Ya YpR 2 reee rY,) 


= qd d ) 


en he ee cee ey 


+ Yy-1 eH (Y, 5 13! eee rY,) (2) 


= Vy OF, 1S a Veo Vignes r¥y) 


Oy 1%-1 (Y,_9 Yar ee ry) (3) 


is the carry func- 
Substituting (3) 


where Cr~] (¥K-2/+Yk-317-++sY1) 
tion into the (k-1)st position. 
in (1), one has 


oe (Yy4 "Yy_9 goeee rY,) 


= 4 OF. 1% OY Gy Yigresr¥y) 


© Yaar Mpeg ¥y) (4) 
Now, 


Ty. (yy, FOr, 5 ree © rY,) 


= ods od C 


k K-1 ) (5) 


k-1 (Y}_5 "Yy,_3! o 20 rYy 
and 


Te (yp _y=t Yx-2! eee rY,) 
= 4 O41 OG 1G 4 Mig Magee e¥y) 
(6) 


© up Vpegr se ¥y) 


From (5) and (6), one has 


dg, 
ay Qe OS. Vy gt gree e¥y) 7) 
But, 
ei Ye O 4) D Gy Ya Ye gr’ rY,) 
(8) 
Substituting (7) in (8), we get 
Y = y “i 
a | Q.E.D. 


Corollary 1. The function gy (Yn-1/Yn-27-+-7rY1) 
is a runlength-d function. 
Proof. 


Consider the following table which illustrates 
the function Yy, for a uniform shift of d. 


Unscrambling of t~-Ordered Vector 


O O a 0 0) This permutation has been found useful in 
aligning data in memory modules to obtain con- 
O ) Serer SE @) : : 
a i flict-free access of related items in an array 
. : [3-5]. 
O O TL: -gieg. ob 1 This permutation is defined as p(i) = tei 
0 l 0 0 l mod N where t is relatively prime to N. 
: -* BaD as First, it may be observed that, since N= 2", 
3 : t must be an odd number and hence must have a l 
O 1 nr a | 1 in the least significant position. Let ty,ty-1, 
--.,t2,t , be the binary representation of t. In 
1 @) O° sea “O 0) : 
this, t = J]. 
1 @) O. ieee’ UL 0 
: ; Theorem 4. For the permutations that realize the 
: ° unscrambling of t-ordered vectors, Y, can 
1 0 1 oe. oc A: 
be expressed as Y, = y,@ Gy l¥p-1:Yn-o7-+°) 
1 1 ) ta. “10 1 Y})- The functions Yy-j1,/Yn-2,---,r,Y] are 
: “(4 related to Y, through the following recur- 
1 l 1 4 1 Sive rule. 


if Y= ¥, OG, (Ya Vp gree r¥y) ’ 
It is readily seen that YOvn = gy l¥n=1+ 
Yn-27+++r¥]) is independent of y, andisa dg* 


runlength-d function. then Yee = i Oe , where 
Oc BD: k-1 
Example 3. dg, 
4 if (t-1)/2 is even 
Consider the following permutation which dg Yy-] 
corresponds to a uniform shift of 3 on N= 8. ay = 
k-1 oy 
ui ag if (t-1)/2 is odd 
k-1 
Proof. 


Consider Table 1 which illustrates the vari- 

ous terms in the multiplication of (t_,t jildrscees 
‘ ; n .n-l 
t,) with the variables (y ,y 4,---+,/Y,)- 

Since the multiplication is performed modulo 
N = 22, we can discard all the partial product 
bits to the left of the nt® bit position, as 
shown by the dotted line. 

First, it may be observed that, in each col- 
umn, there may be one or more carry bits in addi- 
tion to the partial product bits. These carry 
e259 @ y @y @y . bits are generated during the summation of the 

3 2 1 i 2 columns to the right of this column, which are 
then propagated to this position. Let Cy, 2<¢€< 


mo FF OO YN HD WN BP wW& 
HP rPeE FE OO CO Oo 
Fr OOF FT oOo 
FOF O KF O FH O 
020 OF FF EF Oo 
KF OOF HF O OO fF 
OF OF OF Oo 


0 
1 
2 
3 
4 
5 
6 
7 


Y5 = y,@y,@1 k <n, represent the ex-or sum or these carry 
bits in each position. Note that C, = C, = 0. 
yey @1 Since t is a fixed number, C, is a func- 
1 1: tion Of yk-1/YK-27+-++ry,] only. Thus, 
93(¥31¥r¥y) = OY, Ov, Lo pO na OO 
I> (Y4°Y5°¥,) a; y,@1 Cn ne 'Yn-2' 7 Yy) 
Let 
g, ly,7y,7y,) =1 _ 
to ee Oy VasteYyaot rey! toY,-1O oe Oty, 


It may be seen that )c ( ) 
n=d 7 n=1'%n-2°°°" 7) 


-~ 2. a {ag _ 
Go (Yar¥or¥,) ag ( Since ty 1, one has 


ahi ~ ©) cS ey Ie ee —e rYy) 
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Table 1 


MULTIPLICATION TABLE 


Yn Yn-1 | es a : Yy 
a a oe - S 
| 
| ae tay i ea t1Y¥3 “Yo SY 
| 
a ! 2¥n-1 2°k-1 2¥K-2 to¥2 ea 
I 
! 37) 
I 
" 
hela Rea” | esd nek ee ea 
I 
een ees | see eel ea 
t 
2 
eee | | art 
oe 
Hi 7 eel sake ¥, Yo a 


Similarl Y can be expressed as 2 
Yr K Pp OG oy _gre st ¥y) 


Y. = y¥,@ TY, Yeon er Yq) ; 
in this, ct k Yk- 17Y¥~e-27°+-rYz) represents the 


Moore function for the carry bit produced in the addi- 
: eine of the (k-1)S* column only. Whereas 
Desi es ey ok © Vi. AC, Ce (ee 21Y~-31+++rY,) represents the function for 


se eee sum of the carry bits produced in the addi- 


®t y Ac, (y rY peeme yee) tion of 4th columns 2 < j < (k-2) and which are 
k° 1 Keo Red ks Z i ; th 
Now propagated into the k column. It may be noted 
’ that y,.] does not appear anywhere in the lst 


; @ @t through (k-2)nd column. The first time y,-j 
37k 2 ky] appears is in the (k-1)st column. Therefore, the 
function ee is independent of Yet" Thus, 


Sy. CY, p= Orv, ot 5 dil ry) 


Oc, (Yo =OrYy, _greeer¥y) 
dc 
and k = chy = 
Te as Wa eee! 
(VFL Vy greeer¥y) = t,Ot3y,O--- OY @chi 4 
eee een ee 
6) C. (y, pal Yy 9 eee rY,) C) co? | 
Thus, KY 2 K-37 77 
dg 2 
oe = Gc (y ry peeety,) 
ai OCH. OM sree) k SY K-2'%K-3 1 
ec, ly, .=2 acy 
O Oly yabryy gre Vy 22 (10) 
Ye-1 
de, 4 
= 9 
' Oa, 1 ?) Consider the function eG. 1YK- QreeerY1)- 
This, by definition, is the carry bit into xth 


column due to the addition of the terms Ye-1! 
CoY,K-2 1U3Yp-3 peee rty_1Y] WCy_y which appear in 
(k-1)st column. This function is represented in 


We can express Cy (Yx-1/Yp-27-+-+rY]), Which is 
the ex-or sum of all the carry bits into the qth 


position, as aie: Ot : 
c ( ‘ chy The function C, = 1 represents all the 
KY K-17 K-29 FY] Vaan oa es tan aaa | rows which have 2i1's for i = odd integers. 
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Table 2 


1 
THE TRUTH TABLE FOR C 


OY KR-3 


Yep FaYy~2 


k 


Pg ty 


This follows from the observation that, if the 
number of 1's in the (k-1)st column is equal to 
2, 6, 10, 14, etc., then we get a carry in to the 
qth column. 

Now, consider the function 


The truth table for this function can be computed 
in the following way. First, note that dc /dy,-1 
is independent of y;_j, and hence the truth table 
will have half as many rows as the above one for 
ce. The value of the function ach /dy;,_1 for the 
ith row can be computed as follows: Take the 
value of the function cL for the ith row in the 
upper half Bo, and then compute the ex-or sum 
of this with the value of the function for the 
ith row in the lower half B as shown in Table 
2. The ith rows in the Buo and B, are identi- 
cal in the toVp-prt3YRep ree pny positions. 
Further, it may be seen that, if the ith rows 
have an even number of 1's in these positions, 
then the two values of ct are identical. On 
the other hand, if they have an odd number of 


1's, then ce has complementary values for these 
two rows. This, therefore, implies that dcly 
dy, _7 is equal to 1 for the ith row only if it 


has an odd number of 1's in the toyp-1,/t3YK-]1, 
---,Cy_} positions and is equal to 0 otherwise. 
From this, it can now be deduced that 


Hi 
dc). 


= t5¥ 1 O t3% 290+ O17, OG] 


dy. 
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Substituting this first in (10) and then in (9), 


one has 
dg. 
ae = t, @toy, , Otsy 2D ss: 
Ory OS (11) 
But 
Yeah = Vey © oer © SoMa. Pe 
O17 OY1 ee) 
From (11) and (12) 
Jy 
Yer = Yee Ot, © dy, _ 
Case I: (t-1)/2 = even number 
In this case, t. = 0 
dg 
k 
te Oa 
k-1 k-1 dy 4 
Case II: (t-1)/2 = odd number 
In this case, t. = 1] 
dg, 
= 1 
Yuet = Yxu © dy C) 
k-1 
QO.E.D. 


The other permutations [4] such as the 
interchange of elements 2°" apart and 


p(i) = (t-i) mod N' have characterizations similar 
to those presented in Theorems 3 and 4, 

In the next section, some of the above re- 
sults will be used in formulating a procedure for 
characterization of memory processor interconnec- 
tion networks. ; 


On Memory-Processor Interconnection Networks 


In this section, a procedure for characteriz~ 
ing permutation networks is developed. This pro- 
cedure is based on the following observations: 

(1) that any permutation which is realized by a- 
network, after n passes through the network, can 
be expressed as a composition (sequence) of n 
permutations; (2) that each of these n_  permuta- 
tions, in turn, can be expressed as a composition 
of certain elementary permutations. (These ele- 
mentary permutations were described in the last 
section.) Using the earlier derived results (that 
characterized these elementary permutations), one 
can characterize the permutations admitted by the 
network. 

In the following, this technique is first 
developed for shuffle-exchange networks; then, we 
illustrate its applicability to any arbitrary net- 
work. The shuffle-exchange networks considered 
here are the ones proposed by Lawrie [3], and its 
simplification, that proposed by Lang~-Stone [4]. 

First, a brief description of these two types 
of networks is given below. (From this point on, 
the shuffle-exchange network presented by Lawrie 
[3] will be referred to as SE, and the simplified 
version by Lang-Stone [4] will be referred to as 
an SSE network.) Figure 1 presents an abstraction 
of these two networks. 

These networks are N-input, N-output net- 
works, and they can perform many useful permuta- 
tions on N data items, where N = 2M, for some 


n; the data items are circulated through the net- 
work n times (passes) to achieve the desired 
permutation. 


These networks are composed of two subnet- 
works which are denoted as S and E in Fig. l. 
first subnetwork, S, moves the contents of its 
input, i, to its output, j, where j = p(i), 
and p is the perfect shuffle map. The second 
subnetwork, E, moves the contents of its inputs, 
i, and (i+ 1) to its output, (i+ 1) and if, 
respectively, for certain selected i's, where 
i is an even number. For all other inputs, i, 
the contents are moved straight through to the 
outputs, i. Thus, in effect, E performs ex- 
changes on the contents of certain selected pairs 
of adjacent inputs. 

A set of control bits determines the subset 
of pairs which are selected for exchange. One bit 
per pass per data item is all that is needed for 
controlling the exchange operation. The data 
items carry with them these control bits, and, 
thus, the control bits form an integral part of 
the contents of the inputs or outputs. ~ 

The number of control bits that are required 
for operation in the SSE network is a substantial 
number less than that needed in the SE network. 

In the case of SE networks, each data item 
carries with it n-control bits. During the kth 


The - 
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pass, the kth control bits are used for control- 
ling the exchange operation. The contents of a 
certain pair of inputs are exchanged (not ex- 
changed) during the kth pass, if kt® control bits 
(which are contained in both of these inputs) are 
le MLO s 

On the other hand, the SSE network receives 
only one control bit per data item. These bits 
constitute the control bits that are used during 
the first pass. These control bits are used in 
the same way as in the SE network where they con- 
trol the exchange operation during the first 
pass. However, for every successive pass after 
this first pass, a new control bit is computed 
for each data item, and these new bits now con- 
trol the exchange operation during that pass. 
These new bits are computed from the control bits 
that have been used during the immediately pre- 
ceding pass; this is described below: 

During any pass other than the first, cer- 
tain predefined Boolean operations are performed 
on every pair of control bits (which are con- 
tained in the it) ana (i412) th inputs of E for all 
even i). The bits produced by these Boolean 
operations then replace the existing control bits 
(in their respective pairs of inputs), and these 
new bits are then used by E as control bits to 
perform the exchange operation for that current 
pass. 

Let {(i, pay) OS a < N-1} be any permu- 
tation that is admitted (realized) by an SE or 
SSE network in n passes through the network. 
Let this permutation, p, be represented by n- 
switching functions, Yn,;Yn-l,;e+«-,Y1, Of the 
variables YnrYn-1r-+erYj- In the following, we 
derive several results that characterize these 
Yj functions; these will provide a new perspec- 
tive on the permutations admitted by SE and SSE 
networks: 

First, we introduce additional notations 
which will be useful in deriving the results. 

Let {(isP_, (i)) | O<i<N-1} be any permuta- 
tion which is admitted by the network in k 
passes, Let the permutation, p,, be represen- 
ted by n-switching functions, YK,Yf_jr-e-/YT, 
of the variables YprYp-jree-r¥1> 

Next, let {(i,pygli))| Ogi <N-1} be 
that intermediate permutation which is realized 
at the output of the shuffle network, during the 
Kth pass, which results, at the completion of the 
qth pass, in the permutation, p,, at the net- 
work output. Let this permutation, Pye, be 
represented by n-switching function, xk, xk_4, 
eoe,XP Of the variables, YnrYn—l1reeerY1e 

Finally, let Cf} represent that control 
bit used in E for the contents of its ith input 
during the qth pass. 

It follows from the above notations that for 


all Jj, 1<¢j <n: 
a for k= 0, and 
k J 
: (13) 
J 3 for k=n 


Lemma 1. 


k a for 2<¢j<n, and 
xe) kel 
J Yn for y= 1 


Proof. 


Proof is a direct consequence of Theorem 1 
in pons ane eter with the fact that the outputs of 
the (k-1) FA pass are fed back to form the inputs 
to the kt pass. 

O.E.D. 


Theorem 5. The Y¥, functions that represent any 
permutation which is admitted by the SE net- 
work can be expressed as 


Y= My OE Yipee Mr Ye Yano! 


eee rYy) for all k,l<k<m 
where the function, 


control bits ck 
qth pass. 


f,, is defined by the 
that are used during the 


Proof. 


For the sake of simplicity, this proof will 
be developed in two parts: First, we will show 
that 
: ls AO ees A ey 


at 


Then, we prove, in general, that 


Y. = yy. So) re 2 


Vpn Re?! eae rY,) 
Using both Lemma 1 (for k = 1) 
ship (13), one has 


(14) 


The following table describes the relationship 
between Kner kpeapre se (XT and YuevYneitreeerXy. 
Table 3 is derived by using the following 
observations regarding the mapping of input 
addresses to output addresses, as performed by E: 
(a) The output pair is identical to the 
input pair when the input pair is 
not exchanged. 
(b) On the other hand, when an input 
pair is exchanged, the resulting 
output pair has the following 
characteristic: 


The binary numbers that represent 
the output pair are identical to the 
binary numbers that represent the input 
pair in all the positions except the 
least significant position. The bit 
in the least significant position of 


and the relation- 
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the output pair is exactly the comple- 
ment of the bit in the least signifi- 
cant position of the input pair. 

(c) The least significant bit of any 
input pair, i, and (itl) is 
QO and 1, respectively, because 

i is even. 

(d) 


The input pair, Ly, and. (itl), 


is exchanged if Cc; = Cray = 1; 
and the pair is not exchanged if 


Lo ake 
cr=cl =o. 


Table 3 


Y AND X VARIABLES DURING THE FIRST PASS 


C 
2n—] 


As a direct consequence of the above obser- 


vations, it is evident that all the Y-.'s are 
identical to all of the X;'s, except for j=l. 
The column, yi, can be represented as Cy and 


th ang (it1)th rows, respectively 


V41! in the “i 
for all i. 


From this, it can be derived that: 


(iy Ven. Se ny, 
J J 
ak 
(2) v; is a function of xX See 1’ 
ewe e 
1 
The function, vy" can be expressed as shown 
below: 
l=-1-1 - -1 1 
Cok ne re X 1@ x, nett Xok] @ +e 
@® ae 17% OC oaks este C5) 
n nn- n ni 1 
2 -2 
: 1 1 ; 
Since C. = C. for all even 1. 
1 itl 


Consider the following well-known identities in 
Boolean algebra: 


(a) P@Q=P +0, if PQ=O0 (16) 
(b) P=1@p (17) 
(c) P@P=1 (18) 


Using these identities, one can express (15) as: 
1-1 - - - l1- - ~ 
YD oY nn1¥nn-2? 2? %2¥. © oY na? *¥2¥1 © +s 


1 - 1 ~ 
@c n Yn1¥p2t**Y2¥1 OC n ~n-1%n-2°°° 2271 
4 2 -2 
using (13) and Lemma l. 
Let f(y _yrYpuoreeer¥7) 
= Mg aFnagt FoF, © ON aFgeaFgeat Fah @ 
on=-1%n-2°° *%271 ne n-1n-2°* °Y2%1 0 °° 


1 - 1 
© <yn_ ne n-2° 0°VoVy © © yn_)Yn-1¥n-2" *YoVy 


Y= OF Wy Yaeger ey) = 


Using Lemma 1 and the above observation re- 
garding the least significant Y-bit, one can 
state that, in general: 


k k-1 
a Y. = y. Oe es ae Se 0 and 
( ) 5 Y421 J 4 
(b) vt is a function of the variables, 


k=-1 Jk- Kad 
bees e Peeeny | . 


implies the following, in general, 
for any k: 


This, in turn, 


n-k+1 n 


yy = Ya = yy | ty ok (20) 
. ~ Ynek+5 “Yoksy 7 215% oa 
a = oe = (k+l) < 5 <n (22) 
Substituting k=n in (20), one has: 
7%, 


Thus, Eq. (19) now becomes: 


a = y,© Ea Ynnv'Yn-2? eee rY,) 


Now, to prove the theorem, in general, for 
any Yx-, consider the (n-k+1) th pass through the 
network. For the sake of convenience, let n-k+l1 
be denoted as m. 

One can derive the following equation for 
yMayn-k+l | by using techniques similar to those 
used for deriving Yj. 


m_ ym moma smsm m=m= = 
Y= HO ORR 3h @ Coe Rae Bg %y tee 
m m m 
@ a eee a> (23) 
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Using Lemma 1 in conjunction with Eqs. (20) 
through (22), one can deduce the following: 

m n-k+1 

x = x = (24) 
m n-k+1 

. oo ; ; 24S 

ae : a <j<m (25) 

~k+ 

ea aa a m+1<4<n (26) 
J J j-m 


Substituting (24) through (26) in (23), one 
has 


m~ —_ 
X= Oo aye oe Miedo eH 


m- — — ~ ~— 
était arene + fee 
Ct ae pee tin Her Ai eet 


m 
OC n_ Yk k-2 Peng ME, ST 


= + ae ; eed 
Yj EY a og bait Ye ate Y,) 
where 
Se ee eee od ees ee eee a 
= oy ae % oY 
OF k+12k-2 7 °° Yuen n-2 °°° *k+27K41 


m- — — — —_ — — 
e e@e : ees + ee e 
© Co¥, 1% cn Ya net ee 


™m 
OC Ye-k-2 Bee a a nei"? eo ed 


Q.E.D. 


In the following, log N denotes log» N. 
The following corollaries are a direct conse- 
quence of the above theorem. 


Corollary 2. There are exactly 2N log N gis- 
tinct permutations which are admitted by the 
SE network. 


In the following, we derive an interesting 
consequence of Theorem 5 which characterizes the 
permutations that are admitted by the SE network. 

Let p(i) = j and p(i') = j' =£for some 


permutation p where 0 < i,j,i',j' < N-l. Let 


LS (Loedneprc edger ty) 
eG Aes eats) 
,= , # , ! a | i 
ae L= (A) rdl lyre dye eee Dy) 
DS Git pens reserdy) 


be the binary representation of 
j', respectively. 


Ly ye 2 and 


Definition. A permutation p is said to be sym- 
metric in the kt? pit if it satisfies the 
following: Given any i,i', for which 


i; =i! for all j#k, and i, is the 
complement of ip (ip = 1) s then the kth 


bits of the resulting pair, j and j', are 
also the complement of each other, i.e., 
Ix Jy: 


Definition. A permutation will be said to be 
symmetric if it is symmetric in all the n- 
bits. 


Corollary 3. Any permutation which is admitted 
by the SE network is symmetric in the most 
Significant bit. 


What is interesting about the above corol- 
lary, also, is that it provides an explanation as 
to why such a permutation as bit reversal is not 
admissible by the SE network. 

Next, we extend our study to the SSE net- 
works and present several results that reveal the 
structure of the permutations which are admitted 
by these networks. 

In the following, we discuss certain genera- 
lized versions of the SSE networks; this enables 
us to derive results with broader implications. 

Let the control function that is used in any 
pass be a function of certain tag bits which are 
transmitted with the data items. Thus, one can 
select any arbitrary control function for any 
pass. (However, it may be noted that we restrict 
the control function to be the same for all data 
items during a particular pass.) Therefore, so 
as to produce the desired permutation, the use of 
any combination of control functions for the n- 
different passes is available. 

Four tag bits are required in order to spec- 
ify l-out-of-16 possible different two-variable 
functions. Since there are n-passes, only 4n ad- 
ditional tag bits are required altogether. (For 
example, given 256 data items, only 32 tag bits 
are required--a small number when compared to the 
256 control bits that are used.) As it will be 
seen later, the use of these additional bits can 
produce a much larger number of permutations. 
This is significant when compared to the number 
of permutations that are admitted by the network 
when the control functions are prespecified. 

This version of the SSE network, which al- 
lows for arbitrary control functions, will be 
hereafter referred to as the "SSEAC" network. 

In the following, we present two lemmas 
which are then used to derive Theorem 6. This 
theorem provides a switching function characteri- 
zation of the permutations which are admitted by 
the above SSEAC networks. Then, certain conse- 
quences of the theorem are examined, and these 
provide further insight into the permutations 
admitted by these networks. 


Lemma 2. 
(a,P) @ a P, © a» a,P.) 
* (boP) Ob P, © --- @®b,P,) 
= (ay * by) P\p@ (a, * b)) P}O--.- 


@ (a, * b,) PL 


where 


(1) a,'s and b,'s are constants and 


are equal to O or l 


(2) P.'s are product terms over some 
variables 


(3) * ais any Boolean operation 


(4) Pi = 0 for all i,j and ij 


(5) P,P, @---@P, = ] 


Lemma 3. The control bits that are used in the 
SSEAC network satisfy the following rela- 


tionship: 
mt+1 mt+1 my, ,m : n 

= = - O< <2 - 
oe Coit] we Ci" 1 =a 1 


where * is that Boolean operation used dur- 
ing the (m+1)th pass to produce the new con- 
trol bits. 


Theorem 6. Any permutation that is admitted by 
the SSEAC network can be represented by 
functions defined as given below: 


K 
HT 


Y Of Wary) and if 


KK 
| 


Yee] = Vie G) fey (Y_5 1Yy,_3° Stese 1Y,) where 


fe (Yyp_5 1Y,_3 | eae rY,) 
SE a OU enres naa) 


* = 
f (Ye liy, 5s 25:0) -2 ry) 


and * is that Boolean operation which is 
used during the (n-k+2)th pass to compute 
the control bits. 


Proof. 


The proof is based on Lemmas 2 and 3 as well 
as some of the observations made in Theorem 5. 


Q.E.D. 


Now, it may be noted that, if the control 
function that is used in the network is either 
an exclusive-or function or an equivalence func- 
tion (as it is proposed in [2]), then the recur- 
Sive relationship between yy and Ye] reduces 
to the following. 


= eee y h 


vy a if the control 
k-1 © dy, function is ex-or 


ae if the control 


df. 
Ye} @) ay O zr function is 
k equivalence 
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It is also interesting to note that the above 
relationship is precisely the same relationship 
derived for those permutations described in The- 
orems 3 and 4. 


Corollary 4. Any permutation which is admitted by 
the SSEAC network is symmetric. 


Proof. 


Proof is an immediate consequence of Theorem 
6. 
O.E.D. 


Corollary 5. The number of permutations that “are 
admitted by the SSEAC network is, at most, 
equal to 


J20t8 log N-8 


Corollary 6. The fraction of SE permutations 
which are admitted by the SSEAC network tends 
to 0, asymptotically, with N. 


Some implications of the above results that 
relate to SSE networks are as follows. 


(1) All of the permutations which are admitted 
by those SSE networks that use ex-or and 
equivalence as control functions have repre- 
sentations that are similar to those charac- 
terizing permutations such as uniform shift, 
unscrambling of t-ordered vectors, etc. 


(2) The use of additional tag bits so as to pro- 
vide for any desired combination of control 
functions can produce a much larger number 
of permutations. 


(3) Any permutation which is admitted by the SSE 
network is symmetric. 


(4) For large values of N, the set of permuta- 
tions which are admitted by the SSE network 
is a relatively small subset of those permu- 
tations which are admitted by the SE network 
(even all possible combinations of control 
functions are provided for the SSE network). 


We discuss in the following how to charac- 
terize any arbitrary permutation network. It is 
a well-known fact that any permutation can be ex- 
pressed as a composition of transpositions [9]. 
Since any transposition can be realized by a com- 
bination of uniform shift permutations and ex- 
change permutations, it follows that any arbitrary 
permutation can be expressed as a composition of 
uniform shift permutations and exchange permuta- 
tions. Thus, any permutation that is realized 
after any number of passes through the network 
can be expressed as a composition of uniform 
shift and exchange permutations. Therefore, using 
the techniques presented in this section, it is 
possible to characterize any permutation network, 
given its structure. 

(Any of the proofs omitted here can be ob- 
tained directly from the authors.) 


Conclusion 


A general framework for the study of permu- 
tations and permutation networks has been intro- 
duced in this paper. Through this formulation, we 
derived certain characterizations for some well- 
known permutations. These results then led to the 
development of a procedure to study the character- 
istics of permutation networks, in general. Also, 
several results that directly relate to shuffle- 
exchange networks were derived as well. 
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Summary 


Various multistage interconnection networks pro- 
posed in the literature provide fast and efficient com 
munication between processors and memory modules. These 
include Data Manipulator, Flip network, Omega network, 
regular SW Banyan network, indirect Binary n-cube net- 
work and Baseline network. In these networks, there 
exist certain numbers of interconnection permutations 
which cannot be achieved in one pass. In these permu- 
tations, at least one common link is required for es- 
tablishing communication between two or more distinct 
pairs of desired inlets-outlets. Such a link is said 
to be in conflict. The objective of this paper is to 
identify the families of permutations which can be 
passed conflict-free for multistage networks mentioned 
earlier. 


Lenfant [1] has classified certain families of 
these permutations or bijections which are most likely 
to occur in parallel processing. Three of these fami- 
lies are conflict-free for Omega network [2]. Even 
though the other multistage networks are topologically 
equivalent [3], these bijections cannot be performed in 
a single pass in any other network. In this paper con 
flict-free families for various multistage interconnec 
tion networks are obtained. The number of such permu- 
tations for each network is seen to be the same. But 
they are disjoint to each other. Hence all these net- 
works combined together provide much more coverage of 
conflict-free permutations. It might be pertinent to 
mention that if the frequency of occurrence of certain 
types of conflict-free permutations are predetermined, 
then this paper does provide a novel technique for se- 
lecting the type of network(s) best suited for that 
application. 


In a general &x% Omega network if we define a map- 
ping for each switching element (PP Py _» ee P,) for 
i stage as under: 


Yzl(@gPo_y +++ PoP), 15 (y+ -PyPy4y-++Py) for O<i<t (1) 


we obtain Omega network which is equivalent to Baseline 
network. The objective here is to identify families of 
permutations for Baseline network which corresponds to 
given families in Omega network. As we are interested 
only in the equivalent permutations, we consider the 
switching elements only at the input and output term- 
inals. Thus substituting i=0 and i=2 in relation (1) 
we get: 
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48202 
for i=0 
for i=2 _ 


Though these mappings are defined for switching 
elements, they are also true for the input and output 
links of the Omega network. Now equation (2.a) indi- 
cates that the input terminal is simply a bit reversal 
of its original binary representation. Equation (2.b) 
shows no change in the respective output terminals, 
Thus, if there is a permutation 


7 A ees) Oa 


(xq) WG, eel Gyn (3.a) 


1? 


in the Omega network, the corresponding equivalent per- 
mutation in the Baseline network would be 


~ 


x sae en 


x 
0 1 2-1 
(3.b) 
(Xp) (x, ) I(x,n_,) 
where x, and x are given by 
x= (b, bod 2 b,) 
a (b, by sa bo by») 


Using the equivalence of (3.a) and (3.b), con- 
flict-free permutations for Baseline network are de- 
veloped. Similarly general relations of equivalence 
between other interconnection networks are obtained 
which provide the conflict free families for other net- 
works. 


Another problem dealt with in this paper is the 
recognition of conflict-free families for some parti- 
cular network, so that such bijections can be perform- 
ed in one pass. If we are given a one-to-one permuta- 
tion and we want to see whether it can be passed by 
some particular network, we just check whether it be- 
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Abstract ~- Interconnection networks usually 
affect the overall cost and performance of paral- 
lel processing systems very significantly. 
Designers should work out various ways to enhance 
the efficiency of the interconnection networks. 
However there is limited work (or no work at all) 
concerning the following areas: how to control 
the blocking-type multistage interconnection net~- 
works to accomplish arbitrary permutation in 
multiple passes; taking advantage of existing 
Benes network control algorithms to realize ar- 
bitrary permutations on blocking-type multistage 
interconnection networks; exploiting the relation- 
ships between the admissible permutations and 
their control information; reconfiguring a network 
to accomplish various functions of different net- 
works. The reverse-exchange multistage inter- 
connection network is used to attack these un- 
solved problems. We first show a way to recon- 
figure a multistage interconnection network to 
accomplish various functions of multistage inter- 
connection networks after comparing the functional 
relationships among-these networks, The relation- 
ships between the admissible permutations and 
their control information of the reverse-exchange 
network are then exploited. Finally we show that 
arbitrary permutations can be realized in two 
passes (or 2(log.N) switching steps, where N is 
the network size Taking advantage of Benes net= 
work control algorithms, we also provide a way to 
control the two-pass structure. 

I. Introduction 

In many proposed or existing parallel pro- 
cessing architectures, an interconnection network 
is used to realize various permutations of data 
between processors or between processors and 
memory modules [1-11]. The interconnection net- 
work in such architectures significantly, and 
sometimes even dominantly, affects the overall 
cost and performance. However, there are still 
problems on the way of designing a cost-effective 
interconnection network. As a continuation of 
previous work [11], this paper first addresses 
some critical problems and then provides relevant 
contributions through the use of a newly con- 
figured multistage interconnection network. 


The possibility of realizing arbitrary per- 
mutations on a multistage interconnection network 
is an important topic to consider. It is noted 
that all the multistage interconnection networks 
so far considered [1-6,11] cannot realize arbi- 
trary permutations in a single pass. Pease [4] 
did point out the necessity of the multiple-pass 
realization on multistage interconnection net- 
works. However, there is no work available on how 
to control the multistage interconnection networks 
to accomplish arbitrary permutations in multiple 
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passes. Using single stage networks, some inves- 
tigators have presented multiple-pass schemes for 
Peart erie arbitrary permutations. These include 
Lang's 0@/N) shuffle-exchange steps [7], 
Orcutt's OQ/N N log N) steps of Illac IV [9], 
Stone's 0((log,N 23 pes fect-shuggle steps [10], 
and Siegel's St log, N) 2) algorithm [12]. In 
reviewing these» anit iodo pass schemes we feel 
that OA/N) or O((log,N)”)is still an impractical 
number when N is large. An interconnection net- 
work which can realize arbitrary permutations in 
much less passes with efficient control techni- 
ques available should be favorably considered. 


Another important topic to consider is to 
design an efficient network control mechanism 
for both the single pass and the multiple passes, 
The permutation function and its control infor- 
mation of an interconnection network are closely 
related. However, it is most likely fair to say 
that there is only a limited understanding on 
their relationships. Lawrie [5] proposed a con- 
trol technique named destination-tag scheme, 
which relaxes the requirement of understanding 
the relationship. This destination-tag scheme 
can route data from the input side to the output 
side according to the binary address of the des- 
tination, A homogeneous routing technique which 
removes the direction restriction of the destina- 
tion-tag scheme has also been presented [11]. 
These routing techniques have to take care of the 
conflict resolution and to decide the route 
switching element by switching element. In the 
case that the needed permutation function is 
known, there is no ground to loose time in cal- 
culating the route switching element by 
switching element. Batcher [3] has done some 
work on specifying the control information of the 
flip network according to the permutation func- 
tion. However, his work is restricted on the 
flip permutation functions. On the contrary, for 
the Benes binary network [13], there are well de- 
fined looping algorithms [14,15] which calculate 
the control information according to the permu- 
tation function, It is interesting to note that 
with these looping algorithms Benes binary net- 
work can realize arbitrary permutations while its 
hardward facilities are less than twice those of 
multistage interconnection networks [i-6,11]. So 
far there is not work which can take advantage of 
these looping algorithms on the way of realizing 
arbitrary permutations on a multistage intercon- 
nection network in multiple passes, However, 
Lenfant pointed out [8] that these looping al- 
gorithms are both time-consuming and space-con- 
suming. In order to meet the time constraints 
arising from the use of a Benes binary network as 
the alignment network of a parallel computer, 
Lenfant [8] presented a Benes network control al- 
gorithm that can calculate the control informa- 


tion according to the name of five frequently used 
permutation families. Lenfant and Tahé [16] also 
derived an external control mechanism for the re- 
verse omega network. Lenfant's approach does re- 
present an alternative way in searching for an 
efficient control mechanism. It is worthy of ex- 
ploiting useful permutation families of an inter- 
connection network and then developing the control 
mechanism accordingly. 


The possibility of increasing the combinato- 
rial power of a network through reconfiguring that 
network has not been exploited. The topological 
equivalence of a class of multistage intercon- 
nection networks has been defined previously [11]. 
However, the topological equivalence is not equal 
to the functional equivalence. It would be a 
favored step to define the functional equivalence 
and explicitly express the functional relation- 
ships among the topologically equivalent inter- 
connection networks. These functional relation- 
ships can facilitate obtaining the necessary re- 
configuration information so that one can use 
just one topology and accomplish various functions 
through reconfiguring. 


This paper is organized as follows, In Sect- 
ion II we review our mathematical model developed 
previously [11] and introduce terminologies of 
the reverse-exchange network which will be used to 
demonstrate our solutions to these problems des- 
cribed. Section III provides the functional re- 
lationships among topologically equivalent multi- 
stage interconnection networks. We also address a 
way of reconfiguring a network to increase the 
network combinatorial power, The capability and 
the recusive control algorithm of the reverse- 
exchange network are exploited in Sections IV and 
V, respectively. In Section VI we prove that the 
reverse-exchange network can realize arbitrary 
permutation in two passes. Both the construction 
and the routing scheme are provided, The network 
utilization and limitations are discussed in 
Section VII. 


Ii. The Reverse-Exchange Network 


A mathematical model was previously presented 
to specify various multistage interconnection net- 
works [11]. For a self-contained purpose, we give 
a brief review here. A multistage interconnection 
network can be defined by specifying its configu- 
ration and its control structure. We shall dis- 
cuss the control structure in the next section. 

By the configuration of an interconnection network 
we mean the topology and the logical names of the 
network components: interconnection links and 
switching elements. The topology is defined in 
terms of three parameters: the number of communi- 
cation paths of each switching element; the number 
of switching element stages; the connectivities of 
the interconnection links between switching ele- 
ments. A set of mathematical rules, called 
topology describing rules, is used to describe the 
connectivities of the interconnection links. Fig, 
1 shows a configuration of a 16x16 baseline net- 
work. Note that the logical names of the inter- 
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connection links do not appear in Fig. 1 but they 
can be derived from the logical names of the 
switching elements. With such a configuration, 
we can uniquely identify every component and un- 
ambiguously assign the interconnection task by 
specifying the input logical name and the output 
logical name. 


The baseline network can be used as a re- 
ference to evaluate the relationship among various 
multistage interconnection networks, A class of 
topologically equivalent multistage interconnec- 
tion networks has been defined [11] by using the 
fact that any network topology in this class can 
be described by the same topology describing rules 
of the baseline network if the logical names are 
properly assigned to the network components. In 
the rest of this paper, to configure a network 
means to assign logical names to network com- 
ponents. Note that a network configured in our 
way may have different functional capabilities 
from those of the network with the same topology 
but configured in the other way. We like to ex- 
ploit some terminologies of this class of multi- 
stage interconnection networks before we use it 
in solving the described problems. Since the net- 
works in the class, configured in our way, are 
functionally identical although they have diffe- 
rent topologies, we could get the same results 
no matter which topology is used in demonstration. 
In this paper, we use the topology of the baseline 
network, 


The permutation function of a network is 
accomplished by two components - the intercon- 


nection links and the switching elements, Assume 
the binary representation of integer X is 
12 *X,, where L=n-1 and n=log N. The link 


~ 9% Qn] 
eonnectiviey of level i of the network performs 
the following permutations: 


R, (xX) 1? 2 *Xo) a ie . ee ae gm es : aes (1) 
for 0 <1 <2... For i= 0'and 2+ 1, 
Rie pR oper By) = Kok yar sXe 2 
By Eqs. (1) and (2) we have 
Row (Ry (Ry pees Roe X) yee XQ) ++) = 
XQXpo++ Xoo (3) 


Eq. (3) implies that the overall interconnection 
links of the network perform the bit-reverse per- 
mutation. The exchange is performed by a switch- 
ing element on two inputs named by adjacent num- 


bers. The exchange permutation is defined as 
> are oeeX if c =0 
E(x,x > oe ae mere (4) 
ane ? xx x ifc=l1 
Q° 2-1" ° "0 


where c is the control bit of the switching ele- 
ment. Since there are N/2 switching elements in a 
stage, there is a control vector associated with 
each stage. The notation of C,(j) and E, are used 
to denote, respectively, the control bit for jth 
switching element in stage i and the exchange per- 


mutation of stage i associated with control vector 
C,. Assume X is permuted to P(X) by the network, 
Then 


P(XpXo_ 4° +X) = Ro 1 Ey (Ry (Ep yee 


(R, (Ep (Ry (Xp X—_ XQ) «+ DD 


= Gy (XQ) "e, CK) e9 (X_ 1) eg (Xp) » (5) 
where e.(x,), 0 < i < &, is equal to x, or x, 

: a ae i! i 
depending on the exchange performed by the asso- 
ciated switching element in stage i. Fig. 2 shows 
an example of a permutation realized on the net- 
work of size N = 2> with control vectors as ~* 
specified. Because of the bit-reverse and the 
exchange attributes we use the reverse-exchange 
network to name this uniquely configured class of 
multistage interconnection networks. 


III. Functional Relationships and Reconfiguration 


In this section, we compare functional rela- 
tionships among topologically equivalent multi- 
stage interconnection networks [11] and exploit 
the possibility to reconfigure a network to accom- 
plish functions of various networks. First of all, 
we would like to make it clear that the topologi- 
cal equivalence between two networks implies the 
one-to-one and onto relationship between network 
components of the two networks, and does not 
necessarily imply the identical functionality 
between the two networks. For example, Fig. 3(a) 
shows an omega network reconfigured by using the 
binary tree coding method [11]. This network is 
actually a reverse-exchange network and function- 
ally different from the original omega network 
proposed by Lawrie [5] as shown in Fig. 3(b). 
Previously, we use the topology of the baseline 
network [11] to obtain the topological relation- 
ship among various networks. Here, again, we 
shall use the functions of the baseline network 
to obtain functional relationships among various 
networks. But before evaluating the functional 
relationships, we should define some terminologies 
and mathematical tools. 


The functions of a network are also confined 
by its control structure. Siegal [17] proposed 
that the control structure of a network sets the 
States of switching element. Thus the control 
structure of a network sends the control in- 
formation to switching elements to realize the in- 
terconnection function, We therefore propose here 
that two multistage interconnection networks are 
functionally equivalent (or identical) if they 
have a same set of admissible interconnection 
functions and they can realize any admissible in- 
terconnection function using the same control in- 
formation (without duplication, but a proper 
mapping on its location is allowed). Hence, 
order to compare the functional capability of 
various networks, we have to assume those networks 
have the same control structure which includes the 
states of each switching element and the way to 
set them, In this section, the functional rela- 
tionships among various networks are assessed un- 
der this assumption. Under the assumption, let 


in 
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the set of admissible network functions of the 
baseline network, the reverse baseline network, 
the omega network, the reverse omega network, the 
indirect binary n-cube network, the reverse in- 
direct binary n-cube network, the modified data 
manipulator, the reverse modified data manipula- 
tor, the flip network, the reverse flip network, 
the regular SW banyan network (S=F=2) and the 
reverse regular, SW banyan network (S=F=2) be 


~1 
Bap Bu : Stas on : Cy Cy 9 Dy? Di F Bw? Ey ; Gwe 
gt 
of 


, respectively. Using the previous notation 
the binary representation, we define the bit 
reversal permutation p by 


P(KpX—_ ++ °%X) = XoX1+9°Xps (6) 

and the bit switch permutation 6 by 
5 (x 9X —_ 4 Xp_ 90 + ¥g%_%X1 Xp) = XX 1 Xp_ ot ¥gkoXpX p> 
(7) 


for & 2 de 


Using the above definitions and the binary 
tree coding method we can now evaluate the 
functional relationships. A simple example will 
be demonstrated on the networks shown in Fig. 3. 
The reverse-exchange network shown in Fig. 3(a) 
is configured by using the binary tree coding 
method and has B,. as its set of admissible 
network functions where N=8. The omega network 
shown in Fig. 3(b) of course has as its set of 
admissible network functions where N=8. Now if 
we take a bit-reverse permutations, 0, on the 
logical names of the input links of the omega 
network shown in Fig. 3(b), we should have the 
same logical names for the input links as those 
shown in Fig. 3(a), Therefore from this example, 
we can see, for N=8, 


Qe = (8) 
where Bop implies B_(9). Eq. (8) means that the 
omegan network function can be decomposed into 
two ordered functions: First, a bit-reverse per- 
mutation and then, a baseline network function. 
The argument can be extended for any N. The 
functional relationships between the baseline 
network and other topological equivalent networks 
can be similarly developed. The result is 

listed in the 0 = B, column of Table 1. Note 
that the bit switch permutation, 6, is used in 
the D., D. , Gyand GG rows. It is interesting 
to see that the baseline network and the reverse 
baseline network are functionally identical as 
indicated in Table 1 by 


Byes 


prt eR, 


N N (9) 


The above results in which we use the B, as the 
reference can be manipulated so that new formulas 
can be formed in terms of other network func~ 
tions, For example, we can see 


B = Bye PoPs 


‘ (10) 


since 90°p is equal to the identity. Then accord- 


ing to Eqs. (8) and (10) we have 


= {0 
This result is shown in the first entry in the 
9 = 2. column of Table 1. The other entries of 
Table 1 are obtained in the same way. From, the 
Gg = column of Table 1 we can see = Cy = F 
which means that the omega-network, the reverse 
indirect binary n-cube network, and the reverse 
flip network are functionally identical. Similar- 
ly, from coer cojumns, we can _also see that Cy = 
sf = By Dy N? and Cu Da ; 

The functional relationships shown in Table 1 
lay the ground for the reconfiguration problem, 
The reconfiguration problem can be defined as the 
problem to reassign the logical names of the input 
links and/or the output links of an interconnec- 
tion network in a parallel processing system so 
that the interconnection network can at least 
realize various functions of the networks listed 
in Table 1. For a simple illustration, assume the 
baseline network of size N is installed in a 
parallel processing system. In order to reconfi- 
gure the baseline network to accomplish various 
functions listed in Table 1, two permutation 
functions,Op andoO, are needed to reassign the 
logical names of terminal links. Assume we want 
to realize oy on the topology of the baseline net- 
work. Then by Eq. (8) the logical name of the in- 
put link should be P (xx ..X,) where X Xo 
x, is the original logical name of the iaput btink 
in the baseline network, and the logical name of 
the output link should remain the same, On the 
other hand, if we want to realize G_, then by the 
following formula 


(11) 


=I 


G. = 6oB 


: " (12) 


the logical name of the input link should remain 
the same, and the logical name of the output link 
should be "6 (xp Xp _ x 0° In concluding the way 
to reconfigure a nf ea network from Eqs, (8) 
and (12), we see that in Table 1, the permutation 
appearing on the right hand side of the reference 
network function should be used to assign the 
logical name of the input link and the permu- 
tation appearing on the left hand side should be 
used to assign the logical name of the output 
link. 


A reconfiguration function in the control 
unit of a parallel processing system is highly 
recommended. The reconfiguration function should 
choose a proper network configuration according 
to the algorithm to be processed. It seems nece- 
ssary to have a table which establishes the 
mapping between the algorithm and the network con- 
figuration and the mapping on the control informa- 
tion in various configurations. Furthermore, 
some hardware facilities may be needed for this 
reconfiguration function. Some consequent 
questions may arise from the use of the recon- 
figuration: How many configurations on a network 
topology are necessary to accomplish arbitrary 
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permutations? What is the relationship between 
the configuration and the frequently used per- 
mutations [8]? Does the reconfiguration have 
any impact on the multi-dimensioning-—access 
scheme [5,19]? We will discuss these questions 
in Section VII on the network utilization and 
limitations. 
IV, Admissible Permutations 

The admissible permutations, i.e. permuta- 
tions which can be realized by the reverse- 
exchange network in one pass, are a subset of 
2 ! distinct permutations. In this section we 
will identify these admissible permutations. 


Assume that (A, >» aoe ae g2 8-1" art 


24 974 ,0-1'" 24,0? ee eta the source- 


destination pair of the permutation request 
PRS Re Define a Ja on ea Re | aed 


Z =7, Z aed 
Pod p.& p,&-l p,&-qtl. 


Theorem 1: 
Given a set of distinct permutation re- 
quests, Py = {(A,»2Z,) [0 <i< WN}, Py can be 


realized by the reverse- ae network if and 
only if A. # A. and A, = A i implies 

< < 
Z > ci # Z for j - = 0 < j,k < N and 
O<m<  &, 


k,mtl 


Proof: Recall from previous work [11] that 
for the permutation request ae 


stage m will switch source a to 
link a mre a 2° dene ‘ ae 2 
2 Tend *5,2-mt1 a3 ries -1° 5 mt 
Zs an “ta the level mt+l. A conflict occurs 
r — 


if some other source is also switched to this 
link. That is, for some pairs of permutation 
requests, say (A, a and (A, 2%, Z,), and for some 


m, we have = Pig Ceo 
ee ae ie eee ane epee a 
at ea A a ee or The conflict 


conditi n be represe as A, 
ndition can b epresented as wee A, Gian 


and eal = De ie for permutation requests 

P(A,) = Z. and P(A, ) = Therefore if A, # AL 
i f 

sar A, Aone Pa implies ao ice Ze mt] °° 


i # - 0 < j,k < N and 0 degen, ee (A, 2.) 


and (A, ,Z,) are realizable permutations. On the 
other hand, since there is only one interconnec- 
tion path existing for a permutation, the re- 
verse way of the above statement is also 
true. | Q. E. D. 
Note that if we take Eq. (11) into account, 
Theorem 1 can also. be inferred from the similar 
arguments of Lawrie for 2. [5]. Using Theorem 1 
we will identify some of Whe permutations which 
can be realized by the reverse-exchange network 


in one pass. 
Theorem 2: 


Define x" to be the number whose binary re- 
presentation is the reverse binary representation 
of X, and define Py = {(x,,X,)|0 < i < N)} to be 
the bit-reverse permutation. Then P, is realiz- 
able by the reverse-exchange network. 


Proof: Assume Xe os = 4 for O< i< 
2-1 where j # k. Sinté x, # »f5h j #k we then 
obtain Xe iti # Xe itl from the assumption. The 


proof immediately follows Theorem 1. Q,E.D, 


Theorem 3: 


tf P= {(A,»Z,)|0 < i<N fis realizable by 
the reverse-exchange network and a is an odd 
integer, then P' = {(A,,aZ,)|0 < i < N} is also 
realizable by the network. 


Proof: Define Y, = aZ.. We will prove that 
= i <m< : 
ee Ae tem for j #k and O < m < & implies 
aan eae 


: = ; eee 
Since Pie yea AL em for j #k andO <m< 


implies os tl # Ze ee? hence, as 995 9-17"? 


ee init, fag A Po Macy 9 Oe eat ea 


1 1 eee 1 ° 
Z, : 4) ; for some es 0 t m and Z. : Z 
< < eee i 


representation of a where b,=l since a is odd, 


The products of ae and Y= a2 will result 


k,s 


in Y, : # yy f This result concludes that 

_ | : | P Z : . 
BY op i AL tem for 7 # k and 0 < m < & implies 
a pA ay Q.E.D. 
Theorem 4: 


If PN = {(A,»2,)|0 < i < N} is realizable by 
the reverse-exchange network and b is an integer, 
then P. = {(A,.Z,+b) [0 < 4 < N} is also realiz- 
able by the network. 


It is obvious that 


¢ : ; 
<m < & implies that ea 


for j # k and 
—m 


Proof: 
cl ae 
eee 7 
O<m< & implies DS sey 


#Y 


and also ess 


k,m+1° Q.E.D. 
Corollary 1: The permutation defined by ie 


{(X,,aX,+b)|0 < i < N, a is an odd integer and b 


is an integer}, is realizable by the reverse-ex- 
change network. 


The permutation defined by Pu 


_ Corollary 2: 
{(X, ,T-X,)|0 < i < N and T is an integer} is 
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realizable by the reverse-exchange network. 


Corollaries 1 and 2 are consequences of 
Theorems 2, 3 and 4. Note that the omega network 
has the same properties shown by Theorems 3 and 
4 [5]. 


Theorem 5: 


If Py = {(A, »Z,)|0 < i< N} is realizable by 
the reverse-exchange network and k is an integer, 
then P' = {(A,,Z, ®@k|O < i < N and @ is the bit- 
by-bit EXCLUSIVE OR} is also realizable by the 
network, 


Proof: Define Y, =Z, @k. 
that Y, #Y, if 2 #2 

j.m k,m j.m k,m 
Since P. is realizable by the reverse-exchange 
network, AS om = a for j #k andO<m<& 
7 Q.E.D. 


It can be seen 
for 0 < m < kh. 


also implies ae 


ae 
Corollary 3: The permutation defined by in 
{(K, x, $ k) |0 < i< N and k is an integer} is 


realizable by the reverse-exchange network. 


Corollary 3 is a consequence of Theorems 2 
and 5, 


Theorem 6: 


Define the following binary representations: 


Be aay er ay Aa)? 


= ere a) 
Y, = "6 ae ee es el 
and V= (Y Vs 


i,jo1""'*4,0 
Assume V = U + k mod 21. where k is an integer. 
Then the permutation defined by P,, = 1 (4 Y504 
O<i< N} is realizable by the reverse-exchange 
network, 


Proof: By the definition, if (x 


D mp m1" ee 


*Xq g)? then OE ne fe 


. ae 
joe Gane Yq, for either m < j or 


Hence X = 
p,&-m 


O<m < & implies Y 


ae for p # q and 


# 


it Q,E.D. 


Ya mt’ 


Theorem 7: 


If Py = {(A, »Z,)|0 S41 '< N} is realizable by 
the reverse-exchange network, then P' = {(Z., A,)| 
O<i< N} is also realizable by the network. 


Proof: Since P, is realizable by the 
reverse-exchange network, ae ea ee = 

ee , ¥ z : , 
207 bl for j #k and 0 < m < & implies 
x Zz 


By contradiction, 


ee k, 2°" 7k, 2-m" 


assume that in the case of z,. grr 
> 


joqtl 7k, 2°"° 


j < < Wie 
oP ai for j # k and 0 < q < & we can have a 
= re ‘ i = L- 
a5 2g ayy “ks tq Then there exists m q 
-~1 such that ay Q° 5 mel Ae: eae. ml and — 
= 2 This contradicts 


tile stardagat sHowa’ in the® beginning of this 


eae Hence a eet ae Bie ger, eet for 

1 < < i i oeoe ee 

j # k and 0 < q < & implies as 9 Bigg # an 2 
Q.E.D 


apa 
Corollary 4: - 


{(axX. + b,x, 7) |0 < i< N, ais an odd integer and b 
is an dateven! is realizable by the reverse- 
exchange network. 


The permutation defined by P 


Proof: According to Theorems 2 and 7, we can 
see that Pl= Pee X.)|0 < i < N} is realizable by 
the networ Then by Theorems 3 and 4 it is ob- 
vious that pit ={(x' ax, + b)|O <i <N, a is an 
odd integer and b is an integer} is Peat gape 
From Theorem 7 again, we see that P,= {(aX, + b, Xr 
|0 < i< N, a is an odd integer and b is an inte- 
ger} is realizable. Oke Ds 


We have presented some theorems which identi- 
fy the one-pass realizable permutations of the 
reverse-exchange network. In the next section we 
will use these theorems to classify some specially 
interesting realizable permutations and exploit 
the relationships between the permutation and its 
related control information. 


V. Recursive Routing Mechanism for 


Admissible Permutations 


The homogeneous routing procedure described 
previously in [11] already provides a simple rout- 
ing mechanism. It employs the n-bit destination 
tag to determine the valid state of the in-path 
switching elements. However, because of the 
prescribed reasons, it is preferable to determine 
the control pattern by the name of the admissible 
permutation. We will first identify some classes 
of usefull admissible permutations and then derive 
recursive algorithms which determine control 
patterns according to the permutation names. 


It is easy to verify, using Theorems 1-/, 
that the following categories of important 
permutations are admissible on the reverse- 
exchange network. 


i FO (0 <k < 2") = P(X" ®k) = X and 
P(X ®k) =X. 

2: ne (0 < j,k < 2", 3 odd) = P(GX es 

J> and P(4X+k) =X" 
3. so (O-<-gSng 0S k-x 24) = 

as P(X) = (Kt k - €24)* 

and P(X + 24 - k - €+2%)=x" 
4. x 60 <k< a ) = P(X) = . ® k and 
P(X) = X@k. 
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5s ke (O < j,k < a ; j odd) = 
is P(X) = 4X + k and 
P(X ) = jX +k. 
6. Ora fay 0. kx gt) = 
“k P(X +k - & .94)¥ ) = X and 
POC ws 4 OP a ieee eos, 


Note that Category 4 (5 and 6) is a conse- 
quence of apptying Theorem 7 to Category 1 (2 and 
3). Also §S of Category 3, a direct result of 


Theorem 6, #g*a cyclic shift of amplitude k within 


each segament of size 2°. The ¢€ in ea , can be 
defined as follows. Let X = XoXo _ POsA eer tng 
and k = 0. eg ..kK.. Then € is equal to the 
carry out bit of summing ee +X and Raat 
k ae 
q-2.° 0 
(n) n-1 
Denote by K (P) = v2) 50'S is)? and 
0 < j < n-1, the control patterns of 2 x n 
Denote 


matrix associated with a permutation Pp. 
also by v (b) the 2 bit vector whose 
components are all equal to b. A binary tree 
whose root is a vector v and whose upper subtree 
and lower subtree are K_ and K_, respectively, is 
denoted by [v; K ,K_]. The cascaded matrix 

whose left part and right part are L and R, 
respectively, is denoted by [L;R]. 


Let k be a positive integer and denote, 
respectively, by k"’ and k,, its quotient and its 


remainder in the division by 2, i.e., k = 2k'+ky. 


dheonemS! 


CG ys ty Elk Gn 

where eae) = [v Oe 1) Qs yi. 

Proof: Let X = AL and k = koko i 
ko where £ = n-l, Then X ®k = (xo ® ko) (x, ® 
ko 4)> (Xo ® Ky) - According to the properites of 
the reverse-exchange network described by Eq. (5), 
we have x, = ey ac Okey) OS 2S hy for 

x i R-i --— 
P(X @®k) = asec we can see that, for 
OO: Ss 
| ri (x,) if kos = 0, 
x, = _ (13) 
+ (x, ) if kos = ]. 


the control bit of 
should equal to 


Eq. (13), in turn, implies that 
switching elements in stage 2-i 


ko ae Thus we obtain 
Ce a Gs VOD OG) 5.005 
yor) i (14) 
Eq. (14) is exactly what the theorem implies. The 
same arguments can be made for P(X ® k) = ) ae 
Q.E.D. 


Example: Assume 
0123456 3 


eee a oy ee 


P can be described by 
FS): px @ 4) = 
According to Theorem 8, we have 


Ke) = Ie Ow” Caw GD). 
Hence 
001 


001 
001 


001 
The setting of the network is depicted in Fig. 4. 


x3) (py 2 


Theorem 9: 


Let j = 2j'+ 1. 
ease (Chee he (n- aCe 3 yg (a> Bee. yx (a1) 
(n-1)’ a o 
co +k )], where K (C, i es [k]. 
Proof: Consider P(4x" +k) = We first 
define some notations: 
X = XoXo yey» 
t . 
X XoXo aX 
A ie Jpip4--+dg> 
j Jedg_y -«Sy> 
k koko 4 -+Ko> 
| oes 
k Kok e1 ky, 
Y= jX +k Yo¥e at Vo? and 
' 
Y Yo¥p-a°°'Y1° 
Note that X = 2X' + x,, Y = 2Y¥' + Yo: j= 2j' +1 
and k = 2k' + ko: Furthermore, Yo ~ Xg ® Ko and 
1 4 ryt “1 ' : 
yl=j(X')° + j Xo ik +t Xo Ko: 
According to Eq. (5), we have 
€ (Vp ey (V1) +++ 9 (Vg) = KpXo_1+++%o: (15) 


Hence by Eq. (15) the control bit of switching 
elements in stage 0 is determined by e,(y,) = Xo 
Since Yo = Xy ® ko» the control vector of stage 0 
(n- 1) ). 


is then equal to V The next n-1 control 
vectors are dependent on Y'=y a) ; . These 
control vectors can then be partitione into two 
halves: upper half where x » and lower pnalt 
where x,=l. For the upper ae Y'=5(X' ha + k! 
since x,=0. The permutation for this upper half 
then becomes Oe) +k') = X'. For the lower 
half, Y' = j(X') '+k' +k. since x,=1. The 
permutation for ae lower half then becomes 
P(j(X') + j' +k' +k) X'. The above dis- 
cussion can recursively be applied to these two 
permutations of the upper half and the lower half- 
as shown in the theorem. Similar arguments can 
be made for P(jX + k) = Q.E,D. 


Assume 


Example: 


P 
0 1 23 4 5 6 7 8 91011 12 13 14 15 
10 415 3 9 612 011 514 2 8 7131 
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4 
P can be described by cf ) : P(SX +7) = 
According to Theorem 9,°we have 


ff} to 


[0] 


1 1 [1] 
1 it | [0] 
ra 1 1 0 a 
a3 0 sl 1 
1 ) A [0] 
1 ) i [0] 
1 0 1} [1 


The setting of the network is depicted in Fig. 5. 


Theorem 10: 


For q ae and 0 < k < 24 
ae IT por oe doe 1)) 


q,k' a qs k 
go ws a) 5 Te aon then k 


(n-1) 
qok' tk, 


yoke 


Qa.@s. 
(Sy aTkl. 


Proof: Consider P(X) = (X+k ~er2tyF here. 
According to the value of ¢€ defined previously, 
we can see that the permutation functions like 
P(X) = (X + k)* for stage i where i < q and P(X)= 
X for stage j where j > q. Let €, be the carry 
out of the sum of x_ and k_ for 0 < m < i where 
E970 For stage i, i < q, we have 


e, (x) x, 8 k, ® ba 


From Eq. £46) we ase e e ) = Xo ® k, which 
implies V‘~ ~°<k,) for stage 0. Ocimi Parly to the 
proof of Theorem 9, we partition the next (n-1) 
control vectors into two halves: upper half 
where x, @ k, = 0 and lower half where x, 6 Ko =]. 
In the upper half, if k,=1l, then x,=1 and it ° 
kQ=0 then x,=0. Therefore €,=k, for the upper 
half. Thus the control _y OG: rs i3* the upper half 
qe Since €=0 


can be described by K 
g_ggntrol vector can be 


tid (5 


(16) 


for the lower 


described by K ask ), Concluding above 
arguments for stage i nee i<q. We have 
x) (n) yar ly | vy (nrl),. (Mel) 


x De a Dyy, On the other hand, for stage j 
where j oa. we can obtain the control vectors 
from Theorem 9 since P(X) =X is a case con- 


sider n a ae 9w d= and k=0 Thus 
znd ( tt rg ay ot 1,0 ay . Similar argu- 


ments sa ae for P(X + sg -k-e 24) = xX, 
Q.E.D. 


Example: Assume P = 
012 3 4 5 6 7 8 91011 12 13 14 15 


12 0 8 414 210 613 1 9 515 311 77° 


can be described by g (4) which is an cyclic 
shift of amplitude 3 within each segament of size 
2 as shown in Fig. 6. The application of the 
algorithm shown in Theorem 10 results in 


1 0 0 [0] 

1 0 A [0] 

7 0 0 [0] 

x 64) cae a 1 0 8 [0] 
: ii 1 0 [0] 

i al i [0] 

1 1 0 [0] 

i i A [0] 


The setting of the network is depicted in Fig. 7. 


Corollary 5: The control bit pattern of 
Category 4 (5 and 6) is the same as that Category 
1 (2 and 3) only reversed and with properly per- 
muting the bit location within the stage. 


Proof: Since the baseline network and the 
reverse baseline network are topologically and 
functionally equivalent, the reverse and the 
permutation of the control bit location are ob- 
vious from one to another. The rule for the loca- 
tion permutation of the control bit is defined in 
[11] and is recited here: 


r{(P 


— 


= (P,.. 
i 
Le (17) 


..-P_), is the position of a 
control bit in stage i after the stage order is 
reversed, and (P....P,P,...P 5 is the correct 


bit position aftér sormucines Q.E.D. 


oe aa P 


vege a4 724 FOr 


has aaa 
0<i 


In Eq. (17), (PoP _ 


< 


VI. Realization of Arbitrary Permutations 


In this section we will first show that all 
permutations can be realized by the reverse- 
exchange network in two passes. Next, we consi- 
der the routing scheme for this two-pass struc- 
ture. 

A. Two-Pass Permutation 

The fact that the Benes binary network can 
realize all permutations between its inputs and 
outputs follows the result of Slepian-Dupid 
theorem [13]. Using the above fact, we will prove 
that the reverse-exchange network can realize all 
permutations in two passes. 


Theorem 11: 


The reverse-exchange network can realize all 
permutations in two passes. 


Proof: The theorem will be proven by showing 
that the functions of the Benes binary network can 
be simulated by the baseline network in two 
passes. An example structure for a two-pass im- 
plementation using a baseline network is shown 
in Fig. 8, The input data are fed in on Side l. 
The output data of the first pass are stored in 
the shift register files on Side 2. In the 
second pass, the data in the register files are 
fed back to the input lines on Side 1 and the fin= 
al results are again stored in the register files, 
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The two-pass structure is equivalent to the 
implementation of cascading two baseline networks. 
According to the previous result we can obtain a 
reverse baseline network via properly permuting 
the switching elements and its related links of 
the baseline network. For switching elements in 
the reverse baseline network, the mapping, Y., 
from physical names, (PoP ---p,), to logical 

: -l loi 
names (bobo yee +b), is howd in Eq. (17). The 


mapping, yY,. , from logical names to physical 
names is shown in the following: 
-1] 
Ys [(boby_j+++b,),] = (bp s+ +b, bose ebo aaa 
(18) 
for 0 <i< &. If we rearrange the switching 
elements and its related links of the baseline 
network for the second pass in the equivalent 
structure in ascending order of the physical 
names, which are obtained by applying Y. of Eq. 
(18) on the logical names, we can obtain a 
structure which is formed by a baseline network 
and a reverse baseline network connected front 
to end, An example is shown in Fig. 9. The 
labellings shown in Fig. 9 are the logical names. 
Now, set the switching elements in the first 
stage of the reverse baseline network in the eq- 
uivalent structure on the state of the direct 
connection. The setting is also shown in Fig. 9 
for the example. It can be seen that the 
structure shown in Fig. 9 is equivalent to a 
Benes binary network, Q.E.D. 


B. Routing Scheme 

There exist several algorithms which compute 
the control information for the Benes binary net- 
work. <A simple algorithm is previously shown 
[18]. However, three systematic algorithms have 
been proposed by Opferman and Tsoo-Wu [14], 
Anderson [15], and Lenfant [8]. We shall use 
these three algorithms to derive the control 
information for our two-pass structure, 


The control pattern computed by these three 
algorithms should properly be permuted before it 
can be applied to the reverse-exchange network, 
As shown in Theorem 11, the leftmost n stages of 
the Benes binary network are one-to-one corres- 
pondent to the reverse-exchange network of the. 
first pass from left to right, and the rightmost 
n-l stage of the Benes binary network are one-to- 
one correspondent to the reverse-exchange network 
of the second pass from right to left. The 
switching elements in the leftmost stage of the 
network in the second pass are refined to 
be in the valid state of the direct connection. 
Assume that signals 0 and 1 represent the valid 
states for the direct and the crossed connection, 
We can represent the control pattern of the Benes 
network and the reverse-exchange network by the 
following matrices: 


1. Benes binary network 


20,0: “ga bo, n-2 
Pi. “ie Pt Pget 
B= | 
by by a by 
599 oot pened > 
2. Reverse-exchange network of the 
first pass 
bo 0 rel le POnnet 
10 PEt a ee 
i , 
By ; | Py ; or Py a 
2? | hea De > 
3. Reverse-exchange network of the 
second pass 
“Og I Egan 
ayy aes aaed 
Ry = 
a sig a 
N N 
el at 1 . 


Hence, given matrix B, we can immediately obtain 
matrix R, and derive matrix R, by performing the 
following permutation according to Eq. (18) 


oe a | (19) 


where j = ry" (k) and 1<i<n-l. 
VII. Utilization and Limitations 


Like the other blocking-type multistage in- 
terconnection networks, the reverse-exchange net-— 
work also has limitations. It cannot do the 
identity permutation, +1 mod N, etc. in one pass. 
But with its help we find the way to reconfigure 
the network to achieve all interconnection func- 
tions (including identity permutation and +1 mod 
N, etc.) of other networks in one pass. It also 
facilitates the two-pass structure which can 
realize arbitrary permutations. In this section 
we shall look for impacts of using the reverse- 
exchange network, along with the capabilities of 
the reconfiguration and the two-pass structure, in 
paralell processing systems. 


The bit-reversal permutation is vitally im- 
portant to the computation of the fast Fourier 


transformation. The flip network and the shuffle 
exchange network cannot realize the bit-reverse 
permyretton ep agne pass. The permutation class 
of C,; . and C; ; Which are realizable by the 
reveré-exchahge network in one pass clearly in- 
dicates that the scrambled data can be aligned in 
bit-reverse order and the bit-reverse data can be 
restored in the original scrambled order. 


In the memory to processor interconnection 
organization, the multi-dimensioning access 
memory |5,19| can be accessed by words, by bit 
slices, by byte-slices, etc. A scramble/unscram- 
ble network function is required to scramble the 
data when it is stored into memory and unscramble 
the data when it is read from memory. In the 
course of reconfiguration, the terminal link 
names (i.e. the memory module names and the 
processor names) are transformed by proper 
permutations such as 0, 6 and others. However if 
we modify the scrambling and unscrambling 
accordingly, we can achieve the same purpose of 
the multi-dimensioning access in various configu- 
rations. Thus the reverse-exchange network. is not 
only good for FFT problems but also good for other 
problems in one pass through the use of reconfigu- 
ration. In the processor to processor inter- 
connection organization, the output link is fed 
back to the input link, and the reconfiguration 
done on terminal links in one side automatically 
affect the other side. Therefore the reconfigu- 
ration scheme seems not promising in the processor 
to processor interconnection organization if the 
implementation cannot remove this restriction. 
However, the two-pass structure is good for both 
interconnection organizations, 


An array computer can be composed of large 
numbers of processors for the fast realization of 
large problems. However, in some circumstances, 
the computation should be divided into subgroups 
and each groups, either identical or heteroge- 
neous, can be performed in a small subarray of 
processors and achieve the efficiency through 
parallelism. Hence, it is convenient in these 
cases to be able to partition the computer into 
various subarrays. In the two-pass structure, the 
partition can be done for various sizes of sub- 
arrays. The permutation which may not be exactly 
of size N or of 2's power size is allowed. How- 
ever, in one-pass, the partition can only be 
done for an equal size of 2's power. We have 
shown the partition definitions, S$ and § ., 
for the reverse-exchange network. ItKe part $¢fon 
of some other networks can be found in [17]. 


There are still more work to be done for 
implementing the reconfiguration and the two-pass 
structure. It is interesting to know how many 
configurations are needed to realize arbitrary 
permutations in one pass. It seems also important 
to develop computation algorithms in terms of the 
configurations of a network. To realize a per- 
mutation required in an algorithm we should have 
the knowledge on how to name the terminal links 
and how to control the switching elements. Some 
algorithms may need two or more permutations. A 


mechanism to work out compromise configuring ways 
should be developed. If we cannot work out a 


general way, at least we should try to specify the 
relationships between frequently used permutations 


and configurations of a network. On the other 
hand, the routing scheme used in the two-pass 
structure needs the control algorithm background 
on Benes binary network. The looping algorithms 
need memory storage for computing the control 
pattern and the computing time needed is in the 
order of N/2 log.(N/2). Lenfant's algorithm [8] 
overcomes those deficits. However it is re- 
stricted on some frequently used permutation 
families. It is worthwhile to extend his control 
mechanism to more generalized cases. 


VIII, Conclusion 


The reverse-exchange network is a valuable 
interconnection network not only because it can 
be well adapted to some important algorithmic 
processes such as FFT problems but also because 
it serves as an excellent reference from which 
a lot of fruitful results can be obtained, 
Previously, it was used to identify isomorphic 
structures among a class of multistage inter- 
connection networks. In this paper, we first 
derive the functional relationships among the 
class of multistage interconnection networks as 
shown in Table 1, using the functional property 
of the reverse-exchange network, According to 
the functional relationships, we propose a re- 
configuration scheme which can enhance the net- 
work capability. Then we specify the admissible 
permutations through a set of theorems and derive 
a set of recursive control algorithms to realize 
some useful permutations. Using the reverse- 
exchange property we also prove that the recur- 
sive control algorithms actually work, Finally 
we prove that arbitrary permutations can be 
realized on a two-pass structure of the reverse- 
exchange network. The routing scheme of the two- 
pass structure is also derived, 


The feasibility of reconfiguration presents 
the great reality that a network can accomplish 
various interconnection functions of other net- 
works in one pass. From the study of functional 
relationships we observe that we can accomplish 
other networks on a reference network through 


permuting names of terminal links of the reference 


network. The number of permutation patterns 


needed for such a renaming purpose is surprisingly 


small. For example, the baseline network needs 
only two patterns, p andd, to accomplish various 
interconnection functions (see Table 1). The 
reconfiguration scheme is especially good for the 
memory to processor interconnection organization. 


The two-pass structure offers many flexibi- 
lities for parallel processing. The structure 
can realize arbitrary permutations. With this 
structure the parallel processing system can be 
partitioned into subsystems of arbitrary sizes, 
This kind of system can be used to execute any 
algorithmic process. The two-pass structure is 
not only good for the memory to processor inter- 


connection organization but also good for the pro- 
cessor to processor interconnection organization. 


Finally, we like to point out that we do not 
count down the merits of other types of multistage 
interconnection networks. On the contrary, these 
glorious works on various types of networks 
mentioned in Table 1 are useful and more works 
are encouraged since any one of these networks 
can be configured on the reference network 
topology. Furthermore, any network can serve as 
the reference network for the reconfiguration 
and the two-pass structure since they are topolo- 
gically equivalent. It is our intention to use 
the reconfiguration and the two-pass structure to 
remove the limitations of any one of these net- 
works. 
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Fig. 1 A configuration of a 16 X 16 baseline network. 
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Fig. 2 A permutation realized by the reverse-exchange network of size ae 
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Fig. 3 (a) A reverse-exchange network, 
(b) An omega network, 
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‘Table 1 Functional Relationships Among Various Multistage 
Interconnection Networks (P=f£(6)). 
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PARTITIONING PERMUTATION NETWORKS: THE UNDERLYING THEORY 


Howard Jay Siegel 
Purdue University 
School of Electrical Engineering 
West Lafayette, IN 47907 


Abstract-~-The age of the microcomputer has 
made feasible large-scale multiprocessor systems. 
In order to use this parallel processing power in 
the form of a flexible multiple-SIMD (MSIMD) sys- 
tem, the interconnection network must be parti- 
tionable and dynamically reconfigurable. The 
theory underlying the partitioning of MSIMD sys- 
tem permutation networks into independent subnet- 
works is explored. Conditions for determining if 
a network can be partitioned into independent 
subnetworks and the ways in which it can be par- 
titioned are presented. The use of the theory is 
demonstrated by applying it to a variety of SIMD 
networks. 

I. Introduction 

An SIMD (single instruction stream - multiple 
data stream) machine [8] typically consists of a 
set of N processors, N memories, an interconnec- 
tion network, and a control unit (e.g., the ILLi- 


ac IV (1,51). The control unit broadcasts’ in- 
structions to the processing elements (PEs), 
where each PE is a processor/memory pair. ALL 


active ("turned on") PEs execute the same in- 
struction at the same time. Each PE executes in- 
structions using data taken from a memory with 
which only it is associated. The interconnection 
network allows interprocessor communication. 
When the interconnection network connects at most 
one input to each output it is also called a 
permutation network. . 

An MSIMD (multiple~SIMD) system is a_ parallel 
processing system which can be structured as one 
or more independent SIMD machines. The original 
design of Illiac IV was as an MSIMD system [1]. 
As the microprocessor revolution makes processors 
Less expensive, multimicroprocessor systems which 


can operate in MSIMD mode are being proposed 
[C6,13-15,19,22,24-26]. 
The partitionability of an interconnection 


network is the ability to divide the network into 
independent subnetworks of different sizes 
C23,27]. Each subnetwork of size N' must have 
all of the interconnection capabilities of a com- 
plete network of that same type built to be of 
size N'. A partitionable network can be charac- 
terized by any Limitations on the way in which it 


This work was sponsored by the Air Force Office 
of Scientific Research, Air Force Systems Com- 
mand, USAF, under Grant No. AFOSR-78-3581. The 
United States Government is authorized to repro- 
duce and distribute reprints for Government pur- 
poses notwithstanding any copyright notation 
hereon. 
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can be subdivided. MSIMD systems use partition- 
able interconnection networks to dynamically re- 
configure the system into independent SIMD 
machines of varying sizes. 

The theory underlying the partitioning of 
MSIMD permutation networks into independent sub- 
networks is explored in Section VI. The inter- 
connection networks to be studied are the Cube, 
PM2I, and Shuffle-Exchange. In Sections III, IV, 
and V, these networks are defined and analyzed in 
terms of permutations. This analysis is required 
as preparation for the evaluation of partitiona- 
bility in Section VI. The next section discussed 
how to view interconnections as permutations. 


II. Interconnection Networks and Permutations 

Formally, an interconnection network is de> 
fined to be a set of interconnection functions 
[17]. Each interconnection function is a bijec- 
tion (a one-to-one and onto mapping) on the sets 
of input/output addresses, the integers from 0 to 
N-1. When interconnection function f is applied, 
the data at input i is moved to output f(i), for 
all i simultaneously, 0 < i <N. Since an inter- 
connection function is a bijection from the set 
of integers 0,1,2,...,N-1, onto that same set, it 
is also a permutation. In later sections it is 
assumed that N is a power of two. 

A cyclic notation can be used to represent the 
bijection f as a permutation. The permutation is 


represented as the product of cycles, where the 
cycle ° e e ° e 

Cig Jy ig s+ Jy Jy? 
means Flig? = jy, fC},) = Jor see, FC oq? = iy, 


and F(j,) = ig: The Length of this cycle is xt1. 


The physical interpretation of this cycle is that 
input jp is connected to output jy, input jy is 


connected to output jo, eee, INput jyu4 is con- 


nected to output iy, and input ly is connected to 
output jp: 


The product of cycles is the composition of 
the bijections the cycles represent. If p and q 
are cycles, then the product pq represents’ the 
effect of first applying p and then applying q. 
For example, if 
p=(01) andq=(12), 
then 
pq =(01)¢12)2=¢021), 


since p maps 0 to 1 and q maps 1 to 2, pq maps 0 
to 2, etc. The composition of cycles is not com- 


mutative, e.g., 
gap =(€12)¢01)2=C¢01 2) # pa. 

The product of two or more permutations is de- 
fined similarly. For example, if | 

g=(01)C¢23) andh=(0123), 
then 
gh=C02)0C¢1)C¢3). 
That is, since g maps 0 to 1 and h maps 1 to 2, 
gh maps O to 2; since g maps 1 to 0 and h maps 0 


to 1, gh maps 1 to 1; etc. In general, the com- 
position of permutations is not commutative. For 
the example above, 
hg =(13)C¢0) €C2)#gh. 
Every permutation can be uniquely represented 
as the product of disjoint cycles [10]. The 


cycle structure of an interconnection function is 
its unique disjoint cycles representation [18]. 
Cycles of length one (that is, fj) = j) are 


typically not included. For example, the cycle 
structure of the function (permutation) 
f(x) = x + 2 modulo 8, O<x < 8, 
is 
(0246)C¢6¢1357). 
The cycle structure of gh defined above is 
gh=tc02)¢1)9¢3)0FC¢02). 


Sections III, IV, and V use the terminology 
reviewed in this section to define the Cube, 
PM2I, and Shuffle-Exchange interconnection net- 


works. The definitions and permutation proper- 
ties discussed in these sections will then be 
used to analyze partitionability in Section VI. 


III. The Cube Interconnection Network 


The Cube interconnection network consists of 


n = logoN interconnection functions 


cube .(S,_4.++54Sq) = Souq 0098444848524 °° 2 8q 
of 


For example, for 


where S = S4-4°"*54Sqe $4 1S complement 


; the 
Si, OSS <N, andQ<i<n. 


N < 8, cube. (0) = 4, 


plemented as a recirculating (single stage) net- 
work or as a multistage network. 

Figure 1 shows a general model for a recircu- 
lating network. Conceptually, a recirculating 
network may be viewed as N input selectors and N 


The Cube network can be im- 


output selectors. The way in which the input 
selectors are connected to the output selectors 
determines -the allowable connections. Since the 


network consists of only a single stage of con- 
nections, multiple passes through the network may 
be required to perform a data transfer, that is, 
the data may have to recirculate through the net~ 
work several times. 


For the Cube network, input selector 
s-1°°*548q is connected to output selectors 
Synge S444545a-4 °° Soe O<i<n. 


Output selector ty-ieertyty gets its inputs from 

input selectors 
theqeestagptatsaeeety, O<i<n. 

To execute the cube . interconnection function, 


input selector j selects the cube . (j) output Line 


( 


l 0 
N U 
P T 
U P 
T U 
T 
Figure 1: Model of a recirculating network. 
"IS" is input selector, "0S" is output 
selector. 
a) bo --L] zt} bP Ls. F(z] 


(c) 
Figure 2: Cube recirculating network for N = 8. 
(a) Cube,. (b) Cube,. (c) Cube.. 
0 1 2 
and output selector j selects the cube, (j) input 
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Line, for all j, O< j <N. Each recirculating 
Cube interconnection function for N = 8 is’ shown 
in Figure 2. 

Various properties of the 
SIMD network 


recirculating Cube 


are presented in 
C17,18,20,21,28,29]. The CHoPP MIMD (multiple 
instruction stream - multiple data stream) 
machine C8] uses a type of recirculating Cube 


network [23,31]. 


Figure 3 is a model of a multistage Cube net- 
work. The boxes in this figure are called 
interchange boxes. In general, a multistage Cube 
network consists of n stages of N/2 interchange 
boxes. For each interchange box, the upper and 
Lower outputs are Labeled with the same numbers 
as the upper and lower inputs, respectively. 
Each interchange box can connect its lower input 


to its lower output and its upper input to its 
upper output (the straight state) or connect its 
Lower input to its upper output and its upper in- 
put to its lower output (the exchange state). It 
is assumed that each box can be controlled indi- 
vidually Cindependent box control [(21,27]). 
Stage 1 of this multistage network can implement 
the cube. function, that is, connect an input 


Line to the output line that differs from it only 


th 


in the 7 bit position. 


aAcvz— 


STAGE 0 ! 2 
0 0 0 0 0 0 0 
4 Lp 
2 2 a amp 2. 0 
V, U 
T 
(Xe 
, ‘—> U 
4 uf 4 4 T 
5 + ¥ ‘ 
6 6 ae ¥ : . 6 
STRAIGHT Tr EXCHANGE 
Figure 3: Model of a multistage Cube network for 
N= 8. 
The STARAN SIMD network and the indirect 
binary n-cube SIMD network are multistage Cube 


networks, and their capabilities are discussed in 
C2-4,16]. The SW-banyan (S=F=2) proposed for the 
varistructured array processor is also based on 


the multistage Cube topology [9,14]. Other in- 
formation about multistage Cube networks can _ be 
found in [23,27,29,32). - 

intercon- 


In terms of permutations, the cube. 


nection function can be expressed uniquely as a 
product of N/2 disjoint cycles of size two by 
N-1 
ll ( j cube,(j) ). 
j=0 
it” bit of j=0 


For example, for N = 8 cube. is 


C(04)¢15)¢6¢26)9C037). 


Consider a recirculating Cube network (Figure 
2). As stated previously, all active PEs execute 
the same interconnection function (instruction) 
at the same time. In order for a data transfer 
to be representable as a permutation, if one PE 
in acycle is inactive, the other PE in that cy- 
cle must be inactive also. For example, consider 
cube. for N= 8. If PEs O and 4 are inactive, 


the ( 0 4 ) cycle is removed, and the cube. per- 
mutation becomes 


CW SD) C26) C30) 
If only PE 0 was inactive in the above’ example, 
then (1) PE O would "keep" its own data (0 QO) 
and PE 4 would send it data (4 0), a two-to- 
one, not one-to-one, transfer; and (2) PE 4 would 
not receive any data, so the transfer would not 
be onto. Thus, in general, for each cycle in a 
permutation, either all PEs in the cycle must be 
active or all PEs in the cycle must be inactive 
in order for the resulting data to be represent- 


able as a permutation [18]. 
In general, when function cube. is executed, 


E77 


the way in which the data is permuted is 


N-1 
Il © j cube.(j) D, 
os j 
j=0 
where the it’ bit of j = 0 and PE j (Cand PE 


cube. (j)) is active. 


Consider a multistage Cube network (Figure 3). 
Stage i of the network corresponds to the cube . 


permutation if all the interchange boxes in stage 
ji are set to exchange. For example, if all in- 
terchange boxes in the network are set to ex- 
change, the way in which the data is permuted is 


(cube,) (cube, ) (cube)... (cube, 4) 


n-1 N-1 
= {] ¢ II ¢ j cube. (j) ie 
i=0 j=0 
it bit of j=0 


For example, for N = 4, the permutation is 
C04: © 2S Uh OS OTS Dw 


In general, the way in which the data is. per- 
muted is 
n-1 N-1 
1 ¢ I ¢ j cube.(j) ), 
i=0 = j=0 : 
where the jth bit of j=0 and the stage i inter- 
change box whose inputs are labeled j and 


cube, (j) are set to exchange. For example, if in 


Figure 3 only the top row of boxes are set to ex- 
change (and the rest set to straight), the permu- 
tation is 

(01)¢02)¢04)7=C¢0124). 

This section discussed how to describe the ac~ 
tions of the recirculating Cube network and a 
multistage Cube with independent box control in 
terms of permutations. These descriptions will 
be used in Section VI to analyze how these net- 
works can be partitioned into independent subnet- 
works. 


IV. The PM2I Interconnection Network 


The Plus-Minus 2' (PM2I) interconnection net- 
work consists of 2n interconnection functions 


pM2,.(j) = j + 2' modulo N, 


PM2_.(j) = j - 2' modulo N, 


where 0 <j <N, OS 1<n, n= LogoN, and j-x 


N = j+(N-x) modulo N. 


N > 8, PM2, 5 (0) = 4, 
The 


a ad ae 


plemented as a recirculating (single stage) 
work or as a multistage network. 

Consider the model of recirculating networks 
shown in Figure 1. For the PM2I network, input 
selector j is connected to output selectors 


modulo For example, for 


Note that 
PM2I network can be im 


net- 


j+2' and j-2',  O<i<n. 
Output selector j gets its inputs from input 
selectors 

j-2' and j+2', OK<i<n. 


To execute the PM2, . interconnection function, 
input selector j selects the PMe, .¢j) output Line 
and output selector j selects the PM2_.(j) input 
Line, for all j, O< j<N. To execute the PM2_, 


interconnection function, input selector jj 
selects the PM2_.(j) output Line an output selec- 


tor j selects the PM2, . (3) input Line, for all j, 


O<j<N. Each recirculating PM2I interconnec- 
tion function for N = 8 is shown in Figure 4. 


(a) UE a aE es E18 eb oma ae ea 


(b) tf i r Fl r FI i | 
(e) ( 5 i c i # i i 


Figure 4: PM2I recirculating network for N = 8. 
(a) PM2 5° (b) PM2,4- Cc) PM2,5- 
For the PM2_. connections, reverse the 
directions of the arrows. 
a Pe 
b 7 yt k 
: 4\7/ ‘5 y 
d h j ] m Nn 
a“ - 
fi i 
N U 
ao : 
T U 
Bi< | T 
or 
e 
h 
STAGE 2 | 0 


Figure 5: Augmented Data Manipulator multistage 
PM2I network for N= 8. The Lower 
case letters represent "“end-around"” 
connections. 


Various properties of the recirculating PM2I 
network are presented in [17,18,20,21,28,29]. 

Figure 5 shows a multistage PM2I network to- 
pology called the data manipulator network [7]. 
In general, the data manipulator consists of n 
stages of Ncells. For 0 <j <Nand0O<i<n, 


there are three sets of interconnections at stage 
iz one sends the data from input cell j to out- 


put cell j+2' modulo N (PM2, .), one sends the 


data from input cell j to output cell j-2! modulo 
N (PM2_.), and one sends the data from input cell 


j to output cell j (straight). 

The control scheme originally proposed for the 
data manipulator is not flexible enough for par- 
titioning because sets of cells would receive the 
same control signals. For this reason, the more 
flexible (and more costly) augmented data 
manipulator (ADM) network has been proposed [27]. 
In the ADM network, each cell receives its own 
control signals. Specifically, for 0 <i<n, 
each cell at stage i can get any of the signals D 
(“down" = PM2 a, the solid Line in Figure 5), U 


C“up'! = PM2_., the dashed Line), or H 


tal" = straight, the dotted line). 

More information about the data manipulator 
and augmented data manipulator (ADM) multistage 
PM2I networks can be found in L7,21,26-29]. 

In terms of permutations, the PM2, . intercon- 


("horizon- 


nection function can be expressed uniquely as a 
product of the following 2' disjoint cycles of 
size 2” ' by 

2)" i i i 
DTD ¢€ j jt2 jt2*2 > j+3*2> 1... jtN-2 ). 
j=0 | 
For example, for N = 8, PM2,, is 

(0246)C¢1357). 

The PM2_. interconnection function can be ex~ 


pressed uniquely as a product of the following 2! 
disjoint cycles of size 22 by 
2) i i i ..5i 
I] ¢ jtN-2 2... jt3*2> jt2*2 jt2 jd. 
j=0 | 
For example, for N = 8, PM2_, is 


(6420)C¢7531). 

Consider a recirculating PM2I network (Figure 
4). As with the Cube network, if one PE in a cy- 
cle is inactive, the other PEs in that cycle must 
also be inactive if the data transfer is to be 
representable as a permutation. For example, 
consider PM2,, for N= 8 If PEs 0, 2, 4, and 6 


are inactive, the (0246) cycle is_ removed 
and the permutation becomes (135 7 ). 
In general, when the interconnection function 


PM2, . is executed, the way in which the data is 
permuted is 
2'-1 i i i i 
Tl ¢ j jt2’ j+22) j+3%2' 1. jtn-2' ), 
j=0 


where for each j PEs jtk*2', O<k< (207124), 
are all active. When the interconnection func- 
tion PM2_. is executed, the way in the data is 


permuted is 
21-1 i i oa5i 
a C j#N-2° 0. jt342' j+2e2' j+2’ jd, 
I> 


where for each j PEs jtk*2', O<k < (2™ '-1), 
are all active. 

Consider the ADM network (Figure 5). In order 
for the entire data transfer (from the input of 
the network to the output of the network) to be 
representable as a permutation, no data can be 
destroyed at any stage. This implies that, for 
O<i<n, the transfer of data from the input 
cells of stage i to the output cells of stage i 
must be representable as a permutation, that is, 
each input cell must be connected to exactly one 
output cell. In general, at stage i, the flexi- 
ble control scheme allows some cells to execute 
PM2, «7 while others execute PM2_., while still 


others execute "Straight." With the recirculating 
structure this was not allowed, i.e., either all 
active PEs executed PM2, or all active PES exe- 


cuted PM2_. Cinactive PEs being equivalent to the 


"straight" state for the multistage network). 
The following lemma examines how this increased 
flexibility affects the set of permutations per- 
formable by the ADM network. This lemma will be 
used to analyze the partitionability of the ADM 
network in Section VI. 


Lemma: If all data transfers representable 


as permutations, then in the jth stage of the ADM 
network, 0 < i <n, the transfer of data from in- 


are 


put cell j can be represented only as any one of 
the following five cycles: 

ne on 

2. (4 jt2) j+oa2’ j+3e2' 22. jene2! ) 

3. ¢ jtn-2' 12. j+3%2! j+2%2!' j+2' jd, 

4. (© 3 j+2' ), or 

5. € 5 jr2'), 


where all arithmetic is modulo N. 


Proof: There are three cases for the jth stage: 


input cell j connects to output cell j, j+2', or 


aed 

j72. 

Case 1: Input cell j connects to output cell j. 
This is form 1. 

Case 2: Input cell j connects to output cell 


4421. Since the data transfer must be represent- 
able as a permutation, input cell j+2! can not be 
connected to output cell j+2". so it must be con- 
nected to either output cell (a) j+2'-2! =j or 


(b) j+2'+2'. Subcase (a) is, obviously, form 4. 


Subcase (b) is form 2 because for k = 2,3,4,000, 
Cin that order) input cell jtkx2! can not be con- 
nected to either output cell jtk*2! or 


j+(k-1)*2', since they are already connected to 
input cells. Therefore, for this subcase, jtk*2! 
must be connected to output cell j+(k+1)*2), 


o<k<2™', 
Case 3: Input cell j connects to output cell 


j 


nc ss Using arguments similar to those in Case 
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2, it can be shown that this case must generate a 


cycle of either form 3 or 5 (recall that j-kx2! 


= jtn-ke2? = j+(2™ '-k)*2" modulo N). For 
ji = n-1, forms 2 and 3 are the same and forms 4 
and 5 are the same, since 442" = aso! modulo 


Thus, a permutation is performable at stage 1 
if and only if it can be represented as the pro- 
duct of disjoint cycles of the forms 1 to 5 given 
above. For example, for N = 8, at stage 1, the 
permutation 

(0246)C¢6¢13)C¢57) 
is performable. If perm. represents the permuta- 


tion performed at stage i of the ADM network, 
then the permutation performed by the entire net- 
work is 


0 
II perm.. 
i=n-1 : 
Note that 1 goes from n-1 to 0 because data trav- 
els from stage n-1 to stage n-2 to stage n-3, 
etc. 


This section discussed how to describe the ac- 


tions of the recirculating PM2I network and mul- 
tistage ADM network in terms of permutations. 
These descriptions will be used in Section VI to 


analyze how these networks 
into independent subnetworks. 


V. 


can be partitioned 


The Shuffle-Exchange Network 


The Shuffle-Exchange interconnection network 


consists of two interconnection functions, the 
shuffle and the exchange. 
shuffle(s) _4..S4Sq) = S)_oe+9S4SpS,_4 


where S = S__gee-S4Sp, O< S <N, and n= logoN. 


This will be 

based on N elements. 

shuffle (1) = 2. 
exchange(S) = cube, (S) 


where 0 < S$ <N. For example, for N> 2, ex 
change(1) = 0. The Shuffle-Exchange can be im 
plemented as a recirculating (single step) or as 
a multistage network. 

Consider the model of recirculating networks 
shown in Figure 1. For the Shuffle-Exchange net- 
work, input selector S,-1°°° 50848 is connected 


referred to as a shuffle function 
For example, for N> 4, 


to output selectors = 
$y-2°°° S254 S9S,-4 and $-1°*° 5284 8p° 
Output selector tre-qeretotyty gets its inputs 


from input selectors _ 
totae-aeeetoty and tiaqeestotyty- 
To execute the shuffle interconnection function, 
input selector S,-1°°*548p selects the 
S.-2°°°S4SpS,_4 output Line and output selector 
threqeretyty selects the totaeiettaty 
To execute the exchange interconnection function, 
input selector S,-1°°°548q selects the 


Line 


input Line. 


S -2+S45q out put and output selector 


n=1 


Figure 6: Shuffle-exchange recirculating network 
for N= 8. Solid Line is exchange, 
dashed Line is shuffle. 

taqeeetyty selects the t gestyty input Line. 

The shuffle and exchange interconnection func- 

tions for N = 8 are shown in Figure 6. 

The use of a recirculating Shuffle-Exchange 
network for parallel processing was first pro- 


posed in [30]. Various properties of this net- 
work are discussed in (17,18,20,21,28,30]. 


SE P 
at 
pales Z| 


STAGE 2 l 0 


Figure 7: Model of a multistage Shuffle-Exchange 
network for N= 8. 


Figure 7? is a model of a multistage Shuffle- 
Exchange network. Like the multistage Cube net- 
work model, the multistage Shuffle-Exchange net- 
work consists of n_ stages. Each stage of a 
Shuffle-Exchange network consists of the shuffle 
interconnection (connecting the Line at position 
x to position shuffle(x), 0 < x < N) followed by 
a column of N/2 interchange boxes. Recall that 
the upper and lower outputs of the interchange 
boxes are labeled with the same numbers as the 
upper and lower inputs, respectively. It is as- 
sumed that each interchange box is controlled in- 
dependently and may be in either the straight or 
exchange state. 

Various properties of multistage Shuffle- 
Exchange networks are described in (11,12,21,27]. 
(The interchange boxes of the "omega" multistage 


Shuffle-Exchange network [12] can be in "broad- 


cast" states in addition to the straight and ex- 
change states, but here only the Latter states 
are considered.) 

In terms of a permutation, the exchange inter- 
connection function can be expressed uniquely as 
a product of N/2 disjoint cycles of size two by 


(j jt1). 


j even 
For example, for N = 8 the exchange is 
(01)9¢23)¢645)C¢667). 


Let "shuffle'" mean apply the shuffle function 
i times. Then, in terms of a permutation, the 
shuffle interconnection function can be expressed 
uniquely as a product of disjoint cycles by 
N-1 3 
al ( j shuffle(j) shuffle (j) ... ) 
j=0 
j not ina 
previous cycle 
For example, for N = 8 the shuffle is 


C(0)¢124)¢365) C7) 


=(124)C¢365). 
In general, for a shuffle based on N_ elements, 
the sizes of the cycles in the product of dis- 
joint cycles representation of the shuffle will 
vary. However, the Largest a cycle can be is n, 


since shuffle"(x) = x, 0<x<N. 

Consider a recirculating Shuffle-Exchange net- 
work (Figure 6). Recall that if one PE in a cy~ 
cle is inactive, the other PEs in that cycle must 
also be inactive if the data transfer is to be 
representable as a permutation. For example, 
consider Phe onuerte function for N= 8 If PEs 
1, 2, and 4 are inactive, the (124) cycle is 
removed and the permutation becomes (365 ). 
In general, when the shuffle interconnection 
function is executed, the way in which data is 
permuted is 

N-1 3 
Il (¢€ j shuffle(j) shuffle (j) ... ), 
j=0 
where, for each cycle, j has not appeared in a 


previous cycle and PEs shuffle'(j), O<i<n, 
are all active. 

The permutation analysis for the exchange in- 
terconnection function of a- recirculating 
Shuffle-Exchange network is the same as_ the 
analysis for cube, presented in Section III. 


To describe the permutations performable by 
multistage Shuffle-Exchange networks, their rela- 
tionship to the Generalized Cube network is exam- 
ined. The Generalized Cube network [27] is 
identical to the multistage Cube network shown in 
Figure 3, except the data travels in the opposite 
direction, that is, through stage n-1, then stage 
n-2, then stage n-3, etc. Thus, the permutations 
performable by the Generalized Cube network can 
be expressed as 


0 N~-1 
T ¢€ UH ¢ j cube, (j) 0, 
i=n-1 sj =0 | 


where the ath bit of j=0 and the stage i inter- 
change box whose inputs are labeled j and 
cube, (j) are set to exchange. Notice that the 


outer product goes from i = n-1 to n-2 to n-3, 
etc., due to the order in which the stages of the 
network are traversed. 
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In C27] it is shown that structure and connec- 
tion capabilities of multistage Shuffle-Exchange 
networks are the same as those of the Generalized 
Cube network. At stage i of both networks the 
labels of inputs to interchange boxes differ only 


in their rs bit position. This shows the rela- 
tionship between the Shuffle-Exchange and Cube 
multistage networks. In particular, the inter- 
change boxes in stage i of multistage Shuffle- 
Exchange networks implement the cycles of the 
cube. interconnection function. Therefore, the 


expression above for the permutations performable 
by the Generalized Cube network also describes 
the permutations performable by multistage 
Shuffle-Exchange networks. 

This section discussed how to describe the ac- 
tions of recirculating and multistage Shuffle- 
Exchange networks in terms of permutations. The 
description of the recirculating Shuffle-Exchange 
network will be used in Section VI to show that 
it can not be partitioned. The description of 
multistage Shuffle-Exchange networks will be used 
to show how to partition these networks into in- 
dependent subnetworks. 


VI. Partitioning 
A. Definitions 


To analyze formally the partitioning of inter- 
connection networks, the following definitions 
are introduced: 

(a) P= {0, 1, 2, ... N-1}, the set of physical 
PE addresses; 


(b) l. = {lig, lias Lio, arate Miwa)? the set 
of logical PE addresses in the “he parti- 
tion; 

(c) Ww is the size of lL. (that is, [t. | = Wi, 


where O<w. <N and We is a power of two; 


(d) v is the number of partitions, where 


O<v<N; 
v~1 
set of logical PE addresses for all parti- 


tions, where 
v~1 
IL] = 2 ows = N; 
1=0 
(f) m is a bijection from P to L, such that if 
= ‘<. = € 
mp)? = Ls then m ae? Pye where p, P 
and lis Ely 


The physical interconnection network of a sys~ 
tem is defined in terms of P. 


The cube, interconnection function causes’. the 


PE whose logical address is x to send its data to 
the PE whose logical address is y if and only if 


1 


cube. (m '(x)) =m Cy). In order to partition 


the Cube network into independent subnetworks, 
the mapping m must have certain properties. for 


th 


O<i<v, the i partition, L., must be such 


that Logow. Cube interconnection functions are 


available for its independent use. Furthermore, 
lige O<j< Way must be connected to each of the 


logow. PEs in l. whose logical addresses differ 


from j in only one bit position. 


Theorem 1: In terms of the cycle structure of the 
Cube interconnection functions, the network will 
be partitioned into independent subnetworks§ if 
and only if m is such that ¥i, 0 < i < v, for 
each of lLogow. distinct Cube functions exactly 


wile of the cycles contain only elements of P 


which are mapped to elements of l. by m. In ad- 


dition, for O<r< logow., if 

cube, (mT 9) = ne Chae), then j and k can 
r 

differ in only the pth bit position, YVi,k, 


0 < j,k < Wi. 


Proof: Since the cycles in the cycle structure 
are disjoint, if exactly wife of the size two cy- 


cles contain only elements of P which are mapped 
by m to elements of l., all of the elements of l. 


are included, and only elements of l. are includ- 


ed. Thus, because the cycles are disjoint, no 
element which maps to an element of l, is in a 


cycle other than one of these w/e. Therefore, 
the collection of the we cycles in each of 
Logow, Cube functions constitute a complete and 
independent Cube network for l.. The constraint 
that, for 


~1 _ 71 
ee ae = m CL, 


O<r< logow. and 


cube. (m 
ap 
only in the pth bit position, ¥1,j, 0 < j,k < Wi, 


We j and k can differ 


establishes a correspondence between the physical 
Cube connections and the logical connections for 
a partition, maintaining the properties of the 
Cube network. This constraint requires’ the 
cube, interconnection function to connect PEs 
r 
whose logical addresses differ only in the pth 
bit position. Without this constraint, m would 
not preserve the properties of the Cube network 
(e.g., m(Q) = Lio m(1) = L.3, m(€2) = Lia, 


m(3) = L.5 would incorrectly be allowed for a 
partition of size four). CJ 
1 > % 
and Wo = 2, then one possible choice for m is 
m(Q) = Looe m(1) = Loe m(2) = loge m(3) = L 
m4) Looe m(5) = lias m(6) = lage m(7) = L 


The mapping m meets the requirements by picking 
the following sets of cycles: (a) for Lo: (02) 


(46) from cube,, (04)C¢26) > from cube.; 


and (b) for La: C15) from cube.; and (c) for 


For example, if N = 8, v = 3, Wo = 4, WwW 
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L,: (3 7) from cube... 
cal addresses of all the PEs in a partition l. 


In general, the physi- 


must agree in the n-logaw. bit positions not 


corresponding to the logow. Cube functions the 


partition will use for communications [27]. 

For interprocessor communications within the 
partition, only a subset of the cycles need to be 
used. For example, to connect Log to 


Locj41 modulo 47 9S i <3, use: 


(02)¢646)C¢604)F2=¢0246) 
which, by applying m, is ( Loo bog loo los a. ox 


Consider the model of a recirculating Cube 
network (Figure 1). In order to have multiple 
independent SIMD machines, multiple controllers 
must be available. If each PE sets its own input 
and output selectors, based on the transfer  in- 
structions it receives from its controller, each 
partition can perform Cube cycles independent of 
the other partitions. Subsets of the cycles 
available to a partition are chosen by disabling 
the appropriate PEs during the data transfer. 

In a multistage Cube network (Figure 2), the 
cycle (€ x y ), where x,y € P, is implemented by 
the interchange box uniquely determined by the 
inputs Labeled x and y. The assumption that each 
interchange box is controlled individually is 
needed so that the different partitions can 
operate independently and concurrently. 


C. Partitioning the PM2l Network 


The partitionability of the recirculating PMelI 
network is first examined. Following that, the 
partitionability of the ADM is explored. 


Theorem 2: In terms of the cycle structure of the 
PM2I functions, recirculating PM2I networks will 
be partitioned into independent subnetworks if 
and only if m is such that for ¥i, 0 < i <-v, for 
each of ex logow. PM2I functions there exist cy- 


cles containing all of the elements of P which 
are mapped to elements of l. by m, and nothing 


else. In addition, for O<r< logow., if 

pm2,. Cm (1.29) = mw 'Cls,), then k = j#2 ", and 
aX ij ik’ ” J ? 

if PM2_. (m Cl...) =m Cl.,), then k = j-2 

=m ij ik?’ , 


Proof: Let w, = N/2°, j « P, and m(j) « L.. The 
PM2I function PM2. and PM2_ QO < x < a, can not 


be used by j because their cycles are of Length 


n/227! or longer. If l, uses a cycle longer than 
Ww. it will not be independent of Ls, O<i,j <v, 


i#j. By the definition of the PM2I-network, if 
a partition is of size We it must have ex logow. 


PM2] functions to use. Since a = n-logow. and 
PM2,_ 
the 2x logaw. 


0 < x < acan not be used, this leaves 


functions PM2,_., a<x<n. It 


must now be shown that for each of these func- 
tions, there are cycles which contain all j « P 
such that m(j) « l. and no be P_ such that 


meb) € la. PM2,_ 


Le Therefore, if m(j) « l. all of the W. ele- 


must be available for use by 


ments in the cycle containing j must be in li. 


Thus, l, is defined to be jtk*29, O<k < 279, 


that is all those elements of P whose low-order 
"a" bits equal j [C27]. For a<x<n, any cycle 
containing one element whose Low-order "a" bits 
are j will contain only elements whose low-order 
"a" bits are j. Thus, the choice for l. of 
j+k*2%, O<k< 2" 1 is the only way to. provide 
exlogaow. PM2I functions that can be used indepen- 


dently by number of l.. The constraint stated at 


the end of the Theorem statement ensures that m 
is such that the mathematical properties of the 
PM2I permutations are preserved. For example, 


without the constraint, for N = Wo = 4, 
m0) = log, m1) = log, m2) = lgy, m3) = bp 
would incorrectly be allowed. Co 


Theorem 3: The ADM network can be partitioned 


based on the criteria described in Theorem 2. 

Proof: The Lemma in Section IV. showed the _ five 
forms of cycles needed to partition the ADM into 
independent subnetworks with the properties of 
the complete network. For stage i, the elements 


of the cycles are a subject of j+k*2", 
O<k<2"', and, all of j+k«2', 
O<k<2™', The rest follows from Theorem 2.0) 

For example, if N = 8, v = 1 = ee 
and Wo = 2, then one possible correct choice for 
m is m(Q) = lage m(1) = Lao’ m(2) = Loo 
m(3) = L m(4) = lag, m(5) = loo. m(6) = lo4- 
m(7) = Loz: The mapping m meets the requirements 


by picking the following set of cycles: (a) for 
Lo: (135 7) from PM2, 4, (7 5 3 1 =) from 

-41" (15) ¢€3 7) from PM2, 53 (b) for La: (0 
4 ) from PM2 157 and (c) for Lo: € 2 6 ) from 


PM2,5- (Recall PM2,, = PM2_.+) For interproces- 


include 


Sy Wo = 4, WwW 


sor communications within the partition, only a 
subset of the cycles need to be used. For exam- 
ple, to connect the processor pairs ly, and LOK? 


where j and k differ in the high-order bit posi- 
tion, use 


(1357)¢15)2=¢13)7)C¢57) 
which, by applying m, is 


© log to2 > © oq bog 


Consider the model of a recirculating PM2I 
network (Figure 1). As in the case of partition- 
ing the recirculating Cube network, there must be 
multiple control units. Subsets of the cycles 
available to a partition are chosen by disabling 
PEs during the data transfer. 


Panay ee 
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In the ADM, the cycles corresponding to a 
given stage can be selected by sending the ap- 
propriate control signals (H, D, or U). Cycles 
of the form "(j)" are established by the "H" con- 
nection. 

D. Partitioning the Shuffle-Exchange Network 

If implemented as a recirculating network the 
Shuffle-Exchange network can not be used to par- 
tition the set of PES into independent groups 
whose sizes, Wir O<i<-v, are powers of two. 


Theorem 4: The Shuffle-Exchange recirculating 
network can not be partitioned into independent 
subnetworks. 


Proof: To have a complete recirculating Shuffle- 
Exchange network for a partition of size Wir it 


must first be possible to partition the set P 
into subsets of size w,, 0 < i < v, such that all 


PEs whose physical addresses map to logical ad- 
dresses in l. have a_ shuffle interconnection 


this 


based on Ww. elements. In general, is not 


possible. The assumption that it is possible 
will be made, and as a result, a _ contradiction 
will be reached. Let G = 00...01 « P and 


m(G) e lL. for some i, 0 < i < v, where 0 < we <N 


and Ww. is a power of two. Based on the defini- 


tion of the shuffle interconnection function, the 
size of the Largest cycle of a shuffle function 
based on W. elements is Logow., where 


O< log, <n. But G will be in a cycle of size 
1 

n. In particular, G will be in a cycle contain- 

ing the PEs whose physical addresses are 

00...010, O0...0100, 10...00. Thus, if 

Cj 


m(G) ¢ Li, then l. = N. 


eases 


P and w. = 
: 


To evaluate the partitionability of multistage 


Shuffle-Exchange networks, the permutation ex- 
pression derived in Section V is_ used. The 
results of Theorem 1 are then applied. 

Theorem 5: Multistage Shuffle-Exchange networks 
can be partitioned based on the criteria 
described in Theorem 1. 

Proof: The partitioning of the Cube network in 


Theorem 1 is based upon selecting cycles of the 
permutation representation of the Cube intercon- 
nection functions. It is independent of the ord- 
er of the cycles that comprise the permutation. 
Thus, Theorem 1 applies to the Generalized Cube 
network. Since multistage Shuffle-Exchange net- 
works are equivalent to the Generalized Cube net- 
work, Theorem 1 holds for them also. The choice 


of interchange boxes to implement cycles is as 
described for the Cube. Oo 
VII. Conclusions 


A formal approach to studying the partitiona- 
bility of permutation networks was presented. 
Three types of networks were analyzed. Based on 
conditions on the mapping m, it was shown which 
networks can be partitioned and how to _ choose 
partitions where they are possible. 
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LSI IMPLEMENTATION OPTIONS FOR THE SHUFFLE-EXCHANGE 
NETWORK IN A MICROPROGRAMMED SIMD ARRAY 


Smil Ruhman 
Department of Applied Mathematics 
The Weizmann Institute of Science 
Rehovot, Israel 


A promising approach to the interconnection 
of SIMD processor arrays combines the shuffle- 
exchange network [1] with processor address 
masking [2] of signal reception. But a straight- 
forward implementation would require P bi- 
directional busses of width W, where P is the 
number of processors and W is the word length. 
For 256 processors of 32-bit word length this 
amounts to 8192 bidirectional lines whose ter- 
minations alone would dissipate nearly 2 kilowatts 
at TTL signal levels. Furthermore, if the network 
is to interconnect an array of (bit-slice) micro- 
processors, its hardware requirements may become a 
Significant portion of the total array hardware. 
These arguments urge consideration of series- 
parallel approaches to network implementation and 
their optimization. Smith and Siegel [3] treat 
this question in considerable depth for three net- 
work types including the shuffle-exchange network. 
However, their treatment uses standard logic, 
whereas a realistic evaluation of both hardware 
and speed must be based on LSI implementation. 
This paper presents and compares a number of 
implementation options, both recirculating and 
pipeline, each based on a single repetitive LSI 
chip. They include network microprogramming hard- 
ware and consider the processor interface and net- 
work timing. 


The recirculating network uses a 2 processor- 
line <x 8 bit chip which contains 153 gates and 
requires 38 pins. This chip is a universal 
building block independent of the number of 
processors or the word-length (full utilization 
obtained for a multiple of bytes). Operation is 
speeded up by local recirculation through an 
internal register. Series-parallel handling of 
the word (down to a single byte) with proportional 
reduction in hardware and increase in time is 
possible without any change in the chip or any 
external hardware addition. Exchange control is 
supplied by a 256 word rewritable memory fast 
enough to keep pace with the recirculation rate. 


The pipeline network uses a 4 processor-line 
chip which contains 197 gates and requires 54 
pins. It is independent of word length but does 
vary with the number of processors, hence is not 
truly universal. A bypass path is provided to 
make the pipeline fully equivalent to the recir- 
culating network. Three features are introduced 
that improve speed or economy, and sometimes both. 
Thus, two shuffle-exchanges (with optional bypass) 
are implemented in a single AND-NOR complex to 
save both hardware and time in the pipeline. 
Further, the four processor lines per chip are 


grouped so as to share all data and control lines, 
thus enhancing the packing factor (gates per pin). 
The possibility of grouping in this manner is 
shown to be an inherent property of the shuffle- 
exchange logic, independent of array size. 
Thirdly, an AND-NOR complex may be latched up 
simply by inverting its output and feeding it back 
through an additional AND-leg, yielding a parti- 
cularly fast and economical circuit. Pipelining 
may be implemented to any degree desired but soon 
reaches a point of diminishing returns. Parallel- 
serial handling of the word with proportional 
decrease in time and increase in hardware is 
possible without any change in the pipeline chip 
or any external hardware addition. 


Expressions are given for the hardware and 
time requirements as a function of array size and 
word length, and the performance achievable with 
Schottky-TTL technology is tabulated for the array 
size mentioned earlier. The implementation 
approaches described here may be extended to other 
interconnection networks except for the grouping 
property which is specific to the shuffle exchange 
pipeline and the Omega network. 
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SCHEDULING PARALLEL PROCESSES WITHOUT A COMMON SCHEDULER 


+ 
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An algorithm which solves the critical 
problem for distributed processes is 

We extend the solution of Lamport 
[LL76] by continuing to allow processes to access 
their respective critical sections in any 
arbitrary user-specified order, but with greatly 
reduced storage requirements for each process. In 
addition, we supply a facility for testing the 
presence of deadlock among processes waiting to 
enter their critical code. 
be tolerant of several malfunctioning processors, 
and derive an equation relating the probability of 
total system failure to the probability of many 
individual failures occurring simultaneously among 
the processors. | 


Abstract: 
section 
presented. 


INTRODUCTION 


The "critical section" problem, which 
involves developing a synchronization scheme for a 
set of processes that enforces solo occupancy of 
common code, is further complicated when we 
generalize the circumstances under which the 
scheme will work or restrict the allowable 
solutions in some manner. For example, we will 
assume that the processes execute asynchronously 
(i.e. nothing is known about one process’ rate of 
‘execution relative to that of another process nor 
to the same process’ rate of execution at a 
different time) and that each process must have 
the same solution as every other process. Another 
reasonable objective is to avoid possible deadlock 
resulting from two or more processes waiting for 
each other. 


A number of solutions to the critical section 
problem have been developed and_ studied since 
Dijkstra’s initial paper [EWD65]. The results 
reported in that paper, along with the subsequent 
refinements outlined by Knuth [DEK], deBruijn 
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We show our scheme to. 


address 


and McGuire [EM], assumed 
concurrent processes would be implemented on 
systems. These systems allow 


[deB], and Eisenberg 
that 
multiprogrammed 


different processes to read from or write into any 


memory location. 


Only recently have researchers begun to’ look 
at multiprocessor or distributed systems. In such 
a system, a process may read or write in its local 
memory and may read from another processor’s 
memory, but may not write into another processor’s 
Space. This restriction prevents the use 
of global variables, but does yield one important: 


advantage over multiprogrammed computers: if one 
process fails, the entire systems does not 
necessarily crash, though system performance will 


likely be degraded. 


One of the first examinations of distributed 
systems was done by Dijkstra [EWD74]. This paper 
studied the possibility of processors 
independently recognizing that they had failed and 
correcting themselves to some prescribed state. 
At about the same time Lamport [LL74] presented a 
solution to Dijkstra’s original problem with 
critical sections that obeyed the constraints of 
distributed computers. Rivest and Pratt [RP]. 
improved upon this scheme by bounding the values 
of the variables necessary for inter-process 
coordination and by preventing a process that 
continually fails and restarts from deadlocking 
the system. Further improvements (in terms of 
smaller ranges of values for variables, greater 
fairness when sequencing processes for entry into 
their critical regions, and reduced waiting times 
for processes before entering their respective 
regions) were developed by Peterson and Fischer: 
[PM]. Finally, Katseff [HPK] incorporated the. 
best aspects of each of these solutions, including 
the servicing of processes in the order in which 
they arrive (FIFO), into one algorithm. 


Taking a somewhat different approach, Lamport 
[LL76] recognized the fact that it is not always 
desirable to allow processes to enter’ their 
critical regions in the same order in which they 
attempted to access these regions. It is 
frequently the case that a process may not 
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conflict with another process in the sense that 
they may enter critical regions simultaneously, 
though both these processes may conflict with a 
third process. Furthermore, given a set of 
processes that are currently prevented from 
entering their critical regions, we may wish to 
impose some priority on these processes so_ that 
when conflicting processes eventually do leave 
their critical regions, the process having the 
highest priority, rather than the process that has 
been waiting the longest, will be the first to 
access its own region. 


In this paper we present a modification of 
Lamport’s system that corrects some drawbacks of 
both his and Katseff’s solutions. In particular: 
1) We maintain the basic capabilities of 
Lamport’s design but add a facility to detect the 
formation of anamalous situations in which a _ set 
of processes will deadlock because each process 
believes another process has priority over it. 

2) One variable that is used in  Lamport’s 
solution may grow unboundedly large (though in 
practice this may have little effect). We show 
how to limit to a finite range the possible values 


of all variables used for synchronization 
purposes. 
3)  Lamport’s and Katseff’s code requires’ that 


each process contain an array, the length of which 
is equal to the total number of processes. With 
the recent advances in computer-on-a-chip hardware 


designs, it is quite likely that future machine 
architectures will involve huge numbers of 
communicating processors (capable of running a 


proportionately large 
processor possessing a fairly limited amount of 
memory. Such a hardware scheme is clearly 
incompatible with Lamport’s and 
routines. In our program, each process will need 
to keep track of only a constant number of other 
processes. 


SYSTEM OVERVIEW 


The architecture of the system we will use 
for our studies is conceptually simple: we have a 
set of processors, each processor capable of 
executing at most one process’ from a set of N 
processes, and each processor communicating with a 
subset of the other processors. By "communicate" 
we mean that one processor may read from another’ s 
memory or possibly transmit an interrupt signal 
(this latter condition is not essential); 
however, conforming to the definition of a truly 
distributed system, it may not store into any 
memory but its own. 


We further assume that a processor may fail, 
though it does so in a somewhat orderly fashion. 
A read request issued to a process immediately 
after this process has malfunctioned may return 
arbitrary values. Eventually only some default 
value will be returned by read requests to a 
failing processor, hence it is impossible to 
accurately examine the memory contents of such a 
processor. Each processor has the ability to 
detect its own deviation from normal operating 


number of processes), each 


Katseff’s. 
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conflict: 


protocol and shut itself down without transmitting 
spurious interrupts and without writing incorrect 
information on a disk to which it is linked. The 
process that had been running on a processor until 


that processor malfunctioned may be restarted at 
some predefined point. 
As noted in the previous section, the early 


solutions to the critical section problem require 
disjoint processes to store into common memory 


locations. Many of the synchronization schemes 
that have been proposed to date, such as PV 
[EWD68], monitors [CARH74], and path expressions 


when implemented in terms of semaphores [CH, ANH], 
seem to rely upon a dedicated scheduling routine. 
Unfortunately, such schemes are incompatible with 
the desired autonomy of processors. For if the 
processor in which global data is stored or a 
dedicated scheduler should fail, the entire system 
fails. Lamport has explored many aspects of .a 
synchronization scheme that avoids this drawback, 
though he only touches briefly upon the issue of 
scheduling. We examine this last issue in greater 
detail. 


The synchronization primitive used by Lamport 
is an extension of the conditional critical region 


first proposed by Hoare [CARH71] and _ later 
described by Brinch-Hansen  [PBH/72, PBH73a, 
PBH73b]. This new primitive takes the form 
region <mode> when <conditiom 
do <critical-section> od 
The metavariable <mode> is an expression 


(typically a constant or a single variable) which 
evaluates to an element of some arbitrary finite 
set M (sub ject to Restriction #1 below); 
<condition> is a Boolean expression; 
<critical-section> is an arbitrary length of code 
(subject to Restriction #3) which comprises. the 
critical region. 


It may not be the case that all critical 
regions will conflict with all other critical 
regions in the sense that we may desire two 
processes to be executing their critical regions 
simultaneously, though either or both of these 
processes may in turn prevent a third process from 
entering its region. To formalize this notion, we 
define a symmetric, time-independent function 
MxM --—> {true, false}. We then say 
that two processes conflict if and only if they 
are both attempting to execute region statements 
with respective <mode> values of model and mode2, 
and conflict (model, mode2) = true. 


The semantics of the region statement can be 


stated quite simply: the code in the 
<critical-sectiom may not begin execution if a 
conflicting process has already entered _ the 
<critical-section> of a region statement or if 
<condition> evaluates to false. To prevent 
certain anomalous situations from arising, we must 
enforce the following restrictions on _ our 


synchronization primitive: 


Restriction #1: The value of <mode> must remain 
constant during the entire execution of the 


region statement to which it is associated. 


Restriction #2: To prevent races between 
instructions which alter and examine a when 
<condition>, arguments of the <condition> of 
one process’ region statement which are 
stored in the memory of another process may 
only be modified by this second process 
within a region statement which conflicts 
with the first region statement. 


Note that if the <conditiom of a region statement 
does not depend upon the contents of another 
process’ address space, then this <condition> must 
always evaluate to true, for if this were not so, 
then the process would enter the region statement, 
halt execution until the <condition> became true, 
thereby preventing assignments to the very 
variables that can satisfy the <condition> and 
causing the process to deadlock with itself. 


Restriction #3: A region statement may not be 
one of the instructions in the 
<critical-sectionm> of another region 
statement. 


associated with 
is the difficulty 
synchronization 
usually have a 


One problem 
conditional 


frequently 
critical regions 
they pose in’ expressing some 
problems. These problems 
"scheduling" flavor to them: given a set of 
conflicting processes that are all competing to 
enter their respective critical regions, which 
will take precedence? To remedy this flaw, we 
define a new function must precede: {1, 2, ..., N} 
e: fle 2 N} --> {true, false} which may 
depend upon any information that is available to 
the system. Therefore given a particular i and j 
in the set {l, ..., N}, must precede (i, j) need 
not remain constant over a period of time. 
(Lamport actually calls this function 
"should precede"; we will save this term to 
denote a different function.) 


e@eey 


This very general definition of must precede 
is actually too permissive. The following 
argument illustrates this point. Suppose that in 
addition to i and j, the names of the _ two 
processes, the function must precede depends on K 
other sources oof information, e.g. which 
processes are in their critical regions, which 
processes are awaiting permission to enter their 
regions and how long they’ve been in this state, 
which processes have failed, the values stored in 
the memories of various processes, etc. It is 
very unlikely that a process can examine all K+2 
arguments and instantly determine the value of 
Must precede (i, j). Rather, the process would 
probably scan one or two arguments at a time and 
combine this information with previously computed 
results to obtain a partial answer. This 
procedure would repeat this until all arguments 
have been examined and must precede (i, 
been fully determined. Consider the case in which 
a process is scanning the xth argument of 
must precede (x is in the interval [2, eee, Kt2]) 
when another process alters the value of the _ yth 
argument (y is in the interval [1, ..., x-l]). 
The first process will never rescan the yth 


j) has 
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argument, so the value it finally obtains for 
must precede (i, j) will be incorrect. To 
overcome this difficulty, Lamport assumes that 


must precede is strongly constant, meaning that 


its value will not change when we are in the midst 


of computing it. This convention simplifies 
matters greatly (and in fact probably does not 
pose a severe restriction), so we will adopt it as 
well. 


The interpretation of the must precede 
function is self-evident, but it is important to 
point out that it has meaning only on those 


processes that are simultaneously waiting to enter 
their critical regions and that conflict with one 
another. Putting together the mechanisms we have 
described so far, it becomes clear that a process 
i can enter the <critical-section> of a region 
Statement only if the following three conditions 


are satisfied: 


with 
outside of 


Condition #1: All processes that conflict 
process i are executing code 
their critical regions. 


Condition #2: The when <condition> evaluates to 


true. 
Condition #3: For all processes j that are 
presently executing region statements but 


have not yet entered their 
<critical-sectiom’s, and that conflict with 
process i 


true if j has been 
waiting longer than i 
must precede (i, j) = 
false if i has been 
waiting longer than j 


In other words, of all the processes that do 
not conflict with another process that is in a 
<critical-section> (#1), that have true when. 


<condition>’s (#2), and that have no predecessors 
(in the sense that there is no conflicting process 
j for which must precede (j, i) holds true), time 
of arrival is the final arbiter (#3). We impose 
one more condition on our system that guarantees 
that no process can _ be locked out of a 
<critical-section> once it has begun executing a. 


region statement: 


Condition #4: Assuming no further processes 
encounter region statements, a process 
satisfying Conditions 1 - 3 will enter its 


<critical-sectiom after a finite delay. 

This condition will follow if we assume that all 
processes make progress executing their 
instructions (though our previous assumption of 
asynchronous operation may make this progress very 
slow) and if a permanent deadlock situation does 
not exist among the processes that are waiting to 
enter their critical regions. 


THE ALGORITHM 


In the last section we briefly mentioned the 
possibility of two or more processes causing a 
deadlock while waiting to enter critical regions. 
To see how this might happen, consider the most 
trivial case for the moment. Suppose that process 
i has just encountered the statement 


region model when true do <anything> od 


where conflict (model, model) true and 
must precede (i, i) true. Using our rules for 
selecting processes to enter their 
<critical-section>’s, process i must wait for 
itself to leave its <critical-section> before it 
can enter it, a clear impossibility. A deadlock 
is present, and Condition #4 is violated (unless 
must precede (i, i) changes to false at some 
future point). Although this may seem like a 
contrived example, and therefore not a very 
convincing justification for our attempts to 
determine the existence of deadlocks, these 
deadlocks can arise in far more subtle ways. The 
following theorem characterizes the situations in 
which a deadlock will be present. 


Cycle Theorem: A deadlock will exist among’ the 
processes that are awaiting entrance to their 
critical regions if and only if there exists 
a subset {P(O), P(l), .--, P(L)} of these 
processes which form a "cycle" in the sense 
that for all i in the set {0, l, ..., L} 

P(i) is ina region statement with <mode> 
value M(i), and 

the functions must precede (P(i), 
L)) and conflict (M(i), M(itl 
evaluate to true. 


(1) 
(2) P(i+l mod 
mod L)) 


The "if" part follows immediately from our 
The "only if" part stems from the 
following fact: if we trace backwards over the 
must precede and conflict relations on a finite 
set of processes, we must eventually either return 
to a process which has already been visited 
(thereby showing the presence of a cycle), or else 
we will arrive at a process i for which there are 
no processes j such that must precede (j, i) 
true and processes i and j are in conficting 


Proof: 
definitions. 


region statements. In this latter case there is 
no cycle, but process i can enter its 
<critical-sectionm> and there is no deadlock. 

We must establish several ground rules for 


manipulating faulty processes so that we will have 
a common convention with which to _ work. In 
addition to assuming that a failing process does 
not behave "maliciously," e.g. it sends off 
spurious interrupts to the remaining operational 
processes, we further assume that we have some 
reliable mechanism for determining whether a 
particular process has failed. A process can be 
thought of as emitting a "carrier signal"; when 
the signal dies, the process has failed. 


which fail while 
until 
they 


Processes 
remain there 
them so that 


on the queue 
some external device repairs 
can eventually enter their 


must precede. 


critical sections. We adopt this convention on 
the basis of its being the most general scheme for 
dealing with the failure of enqueued processes. 


"Most general," in the sense used here, means_ the 
ability of this scheme to simulate any other 
scheme. This generality arises from the 
flexibility of scheduling provided by the 


For example, we could easily alter 
the value of must precede to effectively ignore 
the presence of a failed process on the queue. Of 


course, we are assuming that in such a situation 
the values of the arguments to must precede can be 
determined despite the loss of accessibility to 
data that has been stored by malfunctioning 
processes. 

Processes which fail while executing their 


critical sections can block many other processes 
with which they conflict, thereby causing serious 
degradation in system performance. We will assume 
such processes are to be removed from their 
critical sections by the external mechanism before 
being repaired and returned to normal operation. 


Note that once in its critical section, a process 
is beyond the effects of the must precede 
function. Thus we do not have the run-time 
flexibility we had when dealing with tthe failure 
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of enqueued processes, and we appear to be quite 
rigidly bound by whatever scheme we choose for 
servicing processes which fail in the midst of 


their critical code. 


It would be unreasonable to assume that a 
process can be made to stop, perform some desired 
operation, and resume unless it is under our 
control. Thus we cannot expect the cooperation of 
processes which are executing their critical 
sections or non-critical sections. The only times 
a process does come under our control so that it 
can be made to perform synchronization tasks is 


when it is waiting on the queue and leaving its 
critical section. 
Be cause concurrent computations are 


inherently difficult to understand (and rigorous 
mathematical proofs of their correctness are even 
more difficult to comprehend), we will break down 
the development of the algorithm into three steps. 


In the first version, we deal with a sequential 
program that will temporarily serve as our 
scheduler and that is easy to comprehend. In the 
next version, we transform the sequential program 
into a parallel program. At this point we are 
halfway to our target program: control of 
instruction sequencing has been removed from the 
central scheduler and is now managed by the 


individual processes, but shared memory is still 
utilized. In the final version, we convert this 
parallel program into fully distributed code by 
passing out the common storage locations among the 
component processes. (For notational convenience, 


we say i ==> j if conflict(i,j) = true and 
must precede(i,j) = true.) 

There are several advantages to treating the 
development of a distributed program as _ code 
synthesis beginning with a simple statement of the 
solution rather than as a programming task 
followed by a verification phase. Not only are 


of programs (especially parallel programs) 
difficult to devise, they are almost as difficult 
to understand due to their ad hoc nature. Even if 
the rules of verification could be formalized, 
mechanical verifiers invariably suffer from 
extremely poor efficiency, as the task they are 
meant to perform is almost surely intractible. 


proofs 


Synthesizing code by means of simple 
transformations need not require a major effort, 
just as the compilation of high level sequential 
languages into machine level code can be 
accomplished efficiently and in a _ straightfoward 
manner (presumably because this is a well 
understood task). Furthermore, programming 
techniques demanding verification suffer because 


it is difficult to build each new program upon old 
programs. Instead, many papers dealing with 
parallel processes seem to begin afresh, defining 
low level features, expanding upon them, and 
finally verifying what has been developed. On the 
other hand, synthesis begins with a _ small 
collection of requisite parameters, and modifies 
these to mesh with the low level features of the 
system in a top-down fashion. 


Version 1 


In this initial version, we are dealing with 


a very simple sequential program. The scheduler 
exists as a separate routine (which we will 
presently assume is immune to failure), and 


governs the operation of all other processes. A 
macroscopic view of the operation of the scheduler 
is given by the flowchart in Figure l. 


There is one very important issue that we 
have avoided so far: how do we deal with two or 
more processes that simultaneously begin execution 
of region statements? Or in terms of our system, 
how do we treat processes that signal their 
intention to interact with the scheduler when the 
scheduler is already busy servicing some other 
process? Before proceeding with our description 
of the algorithm, we must put this issue to rest 
by establishing a method for determining the 
relative ordering of such processes. 


Optimally, we would like the scheduling 
routine to service processes in the same 
chronological order these processes signal the 
scheduler. One solution to this’. problen, 
performed at the implementation level of the 


‘system, would be to let each process dispatch an 
interrupt when it wants the attention of the 
‘scheduler. The scheduler, in turn, serves as our 
interrupt handler, and it disables all other 
‘interrupts until the process requesting attention 
‘is fully serviced. In this solution, we have 
pushed the problem back onto the hardware 
mechanism. 


Another possible solution might be to let 
each process maintain a timer while it is awaiting 
the attention of the scheduler. The -timer could 
be a mechanical clock, or we could let the program 
idle in a loop. On each iteration of the loop, a 
variable TIMER would be incremented by one. When 
the scheduler becomes available, it picks the 
process whose timer indicates the longest wait. 


This solution suffers several drawbacks. 
Depending upon the response time of the scheduler, 
the value stored in the timer could grow 
unboundedly large. Even worse, we are dealing 
with an asynchronous system, so the timer may not 
reflect a true measure of the waiting time (though 
if we assume a finite bound on the speed of one 
process relative to another, we are guaranteed 
that all processes will eventually command the 
attention of the scheduler). 


In both of these solutions, we have relied 
upon an external agent to assume the burden of the 
problem. Is it possible to avoid the use of an 
external device entirely? We maintain the answer 
is no. In any realistic system, there will be a 
lower bound on the length of time that can be 


measured. If two events occur within this time 
Span, we are faced with the problem of taking 
these seemingly simultaneous events and 


determining which of them actually came first. 
What choice do we have, but to rely upon an 
external arbiter to resolve this dilemma? 
Hopefully, such an arbiter would either be capable 
of measuring time on a more refined basis, or 
would have some other information, unknown to us, 


for ordering events. 


In our system, the lower bound for measuring 


time is the maximum response time of the 
scheduler. What we have done, in effect, is to 
treat time as a resource, and to insist that 


Mutual exclusion be maintained on this resource at 


those points in time when a process is interacting 
with the scheduler. We note in passing that many 
systems that have been described in the literature 
finesse the issue of simultaneity by assuming’ the 
‘availability of indivisible or atomic operations. 


Version 2 


Continuing with the synthesis of our final 
program, we now "snip" the control mechanism to 
eliminate the explicit scheduler. The scheduler, 
which is still failure-free, can instead be 
‘thought of as existing only in an conceptual form, 


transmitting instructions to tthe individual 
processes. By this we mean that the _ scheduler 
issues aninstruction which all the _ processes 


compete for and execute. The execution of such an 
instructionis finished when all the processes have 
completed their portions of the code, or have 
failed. The result is a parallel program which 
utilizes shared memory. 


In reality, each process will have a copy of 
the scheduler. These individual copies will 
operate in asynchronous parallel manner by using 
a ‘mutual handshake" concept. When one component 
of the scheduler finishes some instruction, it 
polls the other components to determine if they 
have finished their respective instructions, and 
waits until they have done so before proceeding 
with the next instruction. Setting a flag at the 
beginning and- end of each instuction would be a 


simple mechanism for determining whether or not 
each process had finished its scheduling 
instuction. Figure 2 illustrates a sample 


instruction for enqueuing a process that begins 
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executing a region statement. 


We also begin to decompose the queue at this 
point. Instead of having one process, the common 
scheduler, store the configuration of the enqueued 
processes, we now let each component process 
remember its location within the queue. The 
processes on the queue will be strumg together in 
a linear sequence by a set of multiple pointers. 
Each process contains s-element arrays BEFORE and 
AFTER. The value of BEFORE[i] is the identifier 
of the process which arrived on the queue i 
arrivals before the process in which this array is 


stored. AFTER has the complementary meaning. We 
will sometimes subscript a variable with index i 
to emphasize that this variable is local _ to 


process i. Figure 3 provides a global view of the 
structure of these arrays. 


The purpose of the multiple links between 


processes is twofold. First, should a process 
fail, we can still determine which processes 
follow it or precede it on the queue simply by 
following an alternative link around _ the 
malfunctioning process. And second, the 
redundancy of these pointers can be useful for 


Many previous 
problem assume 
some sort 

to the 
that the 
affected. 
realistic 
k and 


detecting the failure of processes. 
solutions to the critical section 
that when a process fails, it turns on 
of signal that beacons its failure 
remaining functional processes, so 
operation of these processes will not be 
Clearly this is not an entirely 
assumption. We note that if BEFORE [j] 
i 
then 


AFTER [j] does not 
k 


likely that either or both of processes 


equal i, it is quite 


i and k 


have failed. Further tests involving comparisons 
with links from other processes could aid in 
pinpointing the exact identity of the 
malfunctioning process. 
Version 3 
In the third and final version of our 
routines, we are ready to eliminate the scheduler 


completely and to distribute both the memory and 
control mechanism to the individual processes. 
Each process has a copy of the scheduler and can 
be thought of as issuing instructions to itself. 
The processes then operate in conjuction to 
determine which instructions should be executed 
and when. 


Possibly the first feature of version 2 that 
strikes the reader is that memory management has 
been almost entirely divided among the constituent 
processes. This division of memory management has 


been one of our prime objectives from the 
beginning, for in order to conform to _ the 
definition of a distributed system and reap the 


fault-tolerant capabilities such systems have to 
offer, we must insure that individual processes 
perform write operations only on their own local 
memories. An examination of the instructions of 
version 2 reveals that all of the instuctions 
cause process i to alter only the contents of its 
own memory. 
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We are not quite finished, however, due _ to 
memory requirements that would result from a 
naive implementation of the instructions. A 
restriction we have placed on our system, along 
with the need for a distributed control mechanisn, 
is that each process use a limited amount of 
memory. In other words, each process should have 
an address space whose size is independent of n, 
the number of processes. Nearly all of the 
instructions obey this_ property, the sole 
exception being the deadlock-test operation. 


the 


The Cycle Theorem tells us that testing for 
deadlock is equivalent to testing for the presence 
of cycles in the relation. Phrasing this 


another way, a deadlock will exist if and only if 
Some process p obeys the relationship p == p, 
where =a is the (non-reflexive) transitive 


closure of ==>. A deterministic algorithm for 
computing the transitive closure on n objects will 
undoubtedly proceed by following the ==> relation 
from one object to the next and backtracking where 
necessary. To prevent some sequence a ==> b ==> c 
== ee ‘z of processes from being examined 
repeatedly, it appears necessary to keep a_ record 
of the processes along such chains that have 
already been scanned and need not be re-examined. 
The number of markers needed to maintain this 
record yields O(n) space complexity in the worst 
case. Linear space complexity is unfortunate from 
our point of view, for even though an amount of 


memory proportional to n will be needed to test 
for deadlock, no single process can directly 
utilize that much space. Thus each of the n 


processes must devote a constant amount of memory 


toward executing the deadlock-test instruction. 


To 
process 
k other 
j such 


see if process i has caused a deadlock, 
i turns a flag CYCLE to ON. Each process 
than i checks to see if there is a process 
that CYCLE ON and j ==> k. If so, 


J 
k sets CYCLE 

k 
link in the potential deadlock cycle. 
either no more processes can set their 
CYCLE to ON (in which case there can be no 
deadlock) and the test ends, or some process k 
such that k ==> i sets CYCLE to ON, and process i 

k 

notes the completed cycle and announces the 
presence of a deadlock. This deadlock check 
algorithm is outlined by the flowchart in’ Figure 
4. 


process to ON, establishing one more 


Eventually, 
values of 


the 
fail 


An analysis of the 
deadlock-test instruction shows that it can 
in either of two circumstances: more than the 
designated number of consecutive processes fail 
simultaneously (in which case the remaining 
Operational processes will not be able to assume 
responsibility for all of their malfunctioning 
counterparts), or all the processes on the queue 
fail simultaneously. However, neither of these 
conditions is too important. We have ruled out 
the first case (or at least know the _ probability 
of its happening). In the second case, there are 
no operational processes on the queue, so_ that 
none could possibly enter their critical sections, 


requirements for 


and the existence of a deadlock is therefore 


inconsequential. 


Note that we have been very liberal in 
allowing the user to risk potential deadlock 
situations. As a result, our deadlock detection 
routine incurs a great deal of run-time expense in 
the form of process’ cross-talk. One possible 
alternative to the scheme 
somewhat more conservative in nature. 
‘permitting the possibility of deadlock at 
compile-time and checking for its presence at 
run-time, we disallow definitions of must precede 
that would allow a deadlock to develop when 
certain combinations of processes are enqueued. 
This compile-time check is simple: we assume all 
processes are on the queue, and use our deadlock 
tester to see if a cycle is present. If no cycle 
exists under these circumstances, no cycle can 
ever exist, and the system is guaranteed to be 
deadlock-free. Otherwise, the user is informed 
that deadlock may develop in the future. Thus we 
need to test for deadlock only when must precede 
changes, and not whenever a new process enters the 
queue. 


Instead of 


FAILURE ANALYSIS 


One drawback of our system is that under 


extreme circumstances the entire system may fail. 
Such a situation would arise if groups”) of 
operational enqueued processes were separated by 


so many failed processes that the former could not 
use the information contained in BEFORE and AFTER 
to derive the relative ordering of the groups. If 
each of these arrays has s elements, at least 2s 
consective processes on the queue must be down at 
the same time for the system to collapse. The 


probability of such a failure occurring is given 
by the formula 
n-s-1 |i/s+1| i-js a 
1+. j [(p-l)p J 
i=0 j=l 
Ss n-i-s-1 
e(i=p)ep “Clap ) 
where pis the probability that an individual 


process will be nonoperational at any particular 
moment. By making s as large as we desire, this 
probability becomes arbitrarily small. 


CONCLUS IONS 


We have demonstrated a solution to the 
critical section problem for distributed systems 
that satisfies the stated design requirements: 

1) It permits arbitrary processes to 
conflict/not confict depending upon the particular 
critical regions they are attempting to enter. 

2) It allows granting access to critical regions 
based upon an arbitrary scheduling function. The 


presented here is. 


:5) The failure and 


multiply-linked list is a 


critical 
can be used for 


order of requests for entering the 
regions is maintained and 
scheduling purposes. 

3) All variables involved in the synchronization 
process assume values from a finite range. 


4) All processes need to store only a_=e small 


amount of data to maintain the synchronization 
scheme. By "small" we mean an amount that is 
independent of of the number of processors in the 
system. 


subsequent 
individual process or even a 
subset of processes will not 
system malfunction. 

6) The creation of a cycle of 
must-precede~related processes and the resulting 
deadlock can be detected, though we do not specify 
what course of action should be taken from that 
point on. 


restart of any 
reasonably small 
cause a widespread 


Most importantly, we have demonstrated a 
technique for transforming an easy-to-understand 
sequential program into a distributed program. 
Each step of the transformation is reasonably 
straightforward. We have attempted to f ind 


natural lines along which to decompose our 
program. With a greater effort, we might hope to 
formalize the transformation process, possibly to 


the point where it could be mechanized. 


Our results point to several other areas that 
should be examined. For example, we have 
described one notion of deadlock, when in fact 
there exists another rather obvious form of 
deadlock with which we have not dealt. If a 
process is waiting on the queue for its when 
condition to turn true, but no other conflicting 


process has yet arrived which can alter this when 
condition, then this process, along with all 
enqueued conflicting processes which it 


must~precede, will sit idle. Determining whether 
a process will alter any variables and thereby 
change when conditions is recursively undecidable, 
so it may not be feasible to build a mechanism to 
accurately detect or correct this type of 
dead lock. Is this an important consideration 
among real parallel routines? If SO, will 
heuristic deadlock testers suffice to make this a 


negligible problem? 


Furthermore, we have been able to develop a 
reasonably simple algorithm by passing the details 
of scheduling, in the form of conflict and 
must-precede relations, to the user. While this 
gives the user a great deal of flexibility, this 
flexibility must be accompanied by a certain 
measure of responsibility. Is all this 
flexibility necessary? Or must the user pay for 
it in terms of the extreme care taken to program 
scheduling relations? And are there techniques he 
might employ developing these relations that would 
allow the synchronization protocols to execute 
with greater efficiency? 


Another area for further study centers around 
the implementation of the queue. Our 
"stretched-out" data 
structure, in the sense that it does not require a 
large set of malfunctioning processes to form a 
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cut set and thereby cause the system to fail. Are 
there alternative data structures which require a 
larger cut set to separate and therefore present a 
‘lower probability of system failure? And exactly 
what would be the tradeoff between the improved 
reliability of these structures and the increased 
complexity and reduced efficiency of the code for 
the critical section problem? 


Mark R. 
related 


Brown provided 
to the failure 
made several 

nature and 
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PARALLEL-AWAIT 


when when when when 


must precede process p process p | process p 
changes begins region leaves critical returns from 
statement section failure 


cht eer ome Petcace nanenmens OF Me car a a i cm Aarne manne 


ra 


deadlock test put p on 

tail of queue 
if any processes | 
can enter critical deadlock test 
section, let them 


if any new 
process can enter 
critical section, 
let them 


if p can enter 
critical section 
let it; otherwise 
re-insert p 

into the queue 


let p continue 
with non-critical 
section 


Figure 1: Version 1 Common Scheduler 


when process j 
begins execution 
of region statement 


determine the distance, m 
from process i to the 
end of the queue 


BEFORE := last s 


elements on the queue 


AFTER, [m] 


Figure 2: Version 2 Instruction for Process i : Enqueue Process j. 
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BEFORE 
1 2 3 


process: 


con. 3S 1 2 3 -es «6S 


Head of Queue 


n-st+l n-s+2 n-st+3 


n-2 n-l n 


T 

ail of Queue soci Bs 
n-2 n-3 ‘ | 
n-] n-2 


n 


Figure 3: Structure of the BEFORE and AFTER arrays. 


yes 


no 
CYCLE, >= ON 


"| PARALLEL-AWAIT 


PARALLEL-—AWATT 


when some when it is 


when some 
nearby 


nearby 
process k no. Ouger process k when 
fails possible fails ; 
process j 
for any 


finishes its 
process k 


deadlock test 


to set 


CYCLE, to ON 


assume the 
operation 
of process k 


announce 
presence 
of 

deadlock 


assume the 
operation 
of process k 


(no 
deadlock) 


Figure 4: Version 3 instruction for process i to 


determine if process j is caught in 
a deadlock. 
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Abstract -- A new variant of the Generalized 
Petri Nets called the C-Colored Petri Model is 
presented. The elegance of this model for repre- 
senting certain concurrent systems is exhibited 
by examples. Theoretical results concerning the 
representation power of this model are also 
presented. 


1. Introduction 


This paper presents a new variant of the 
Generalized Petri Nets ([6]) called the C-Colored 
Petri Model (C-CPM). The C-CPM is a model of 
systems exhibiting concurrency particularly well 
suited for the representation of: 


1. process synchronization structures 
involving dynamic priority hierarchies 
among processes; . 

2. conflict resolution among competing 
processes; 

3. reentrant coordination structures. 


Section 2 formally defines the C-CPM. 
Sections 3 and 4 present a few modelling examples 
using the C-CPM. The selected modelling examples 
emphasize the naturalness and compactness of the 
representation offered by the C-CPM for coordina- 
tion situations of the types mentioned above. 
Section 4 also presents some theoretical results 
concerning the representation power of the C-CPM. 
In particular, the modelling capabilities of the 
C-CPM are compared with the modelling capabili- 
ties of other variants of the Generalized Petri 
Nets such as the Priority Petri Model ([5], 
the Extended Petri Model ([1], [11]), the 
Coordination Petri Model ([8], [11]) and the 
Petri Net Model with switches, disjunctive 
‘logic and token absorbers ([2], [11]). Finally, 
Section 5 draws a few conclusions regarding the 
usefulness of the C-CPM. 


2. Formal Definition of the C-CPM 


Before defining the C-CPM we first give a 
few basic notations and definitions. 


Definition 2.1. A Generalized Petri Net 
(GPN) is a system N=(T,P,1I,0) where: 
1. T is a finite set whose elements are 
called transitions; let T = {tyr---rt }. 


[1l1]), 


196 


2. P is a finite set whose elements are 
called places; let P = {P Pre eee 


+1 n+m 
where Tf} P = @. 
I:PXT>Z° is a function called the input 
incidence function. Z° = {0,1,2,3,.-..-} 
denotes the set of nonnegative integers. 
O:TXP*>Z° is a function called the output 
incidence function. 


4. 


A GPN is represented graphically by a 
bipartite directed graph, where: 
1. transitions are represented by bars, 
2. places are represented by circles, and 
3. bars and circles are connected by directed 
arcs. The set of arcs is denoted by A. 
For every pair (pt) € PxT there are 


exactly Up. ,t) arcs directed from P 
COG. SEE SE ge > 0 th . 4 
k (Ps. 7” te 


called an input place of the transition 


t and the arcs directed from P; to t 


are denoted by aa! cs eer 


k 
T{p.,t,)- 
Each such arc ae is called an input. arc 
of the transition ty 


(from the input 
place Po) Similarly, for every pair 

(ty +P) € TXP there are exactly 

Oana) arcs directed from t to Pas LE 
Osea?) > O then P, is called an output 
place of the transition tL and the 

arcs directed from oe to P, are denoted 
by a4! pr hreeeeOe ae.) Each such arc 
a5 is called an output arc of the 
transition th (to the output place P,)- 


In what follows, by a "set of colors", we 
shall mean a lattice C=(X,<), where X is a finite 
set. An element of the set X is called a color. 


Definition 2.2. Let N=(T,P,1I,0) be a GPN 
and let C=(X,<) be a set of colors associated 
with N. A color marking of N is a function 
CM:P > <X>*, where <xX>* denotes the bag closure 
of the set X, i.e., the set of all bags over 
the set X. 


For a detailed discussion concerning the 
concept of a bag the reader is referred to 
{3,4,10]. Pertinent notations and definitions 
employed in this paper are compatible with [11]. 

For each p€P, CM(p) is the color bag of the 
place p (in the color marking CM). For each 
element of the bag CM(p) there exists a token of 
that color in the place p for the color marking 
CM. A token is represented graphically by drawing 
a circle inside the corresponding place. The 
circle corresponding to each token is filled with 
the respective color (see Figure 2.1). 

Definition 2.3. A C-Colored Petri Net 


(C-CPN) is a system CN=(N,C,F) where: 
1. N = (T,P,I,0) is a GPN. 
2. C= (X,<) is a set of colors associated 
with N. 
3. F :A-> X is a function. 
For each arc a 6A F(a), also denoted by 
i . 
C.., is a color of the set X. If a" is an input 
jk jk 
arc of some transition tL then ei is called the 
J 


color threshold of the input arc aie: On the 


other hand, if aa 
jk 


transition t, then ct 
5 Jk 


is an output arc of some 


is called the output color 


of the output arc ae: Let us now explain the 


roles of the color thresholds and of the output 


colors. Suppose CN is a C-CPN in some color 
: r or 
marking CM . Let tL be a transition of N and 


let P. be an input place of t.. Let us consider 


k 


an input arc a. yk. Sah EP eres 


yk 
Definition 2.4. A color c of a token present 
in the place P, in the color marking CM isa 
candidate enabling color of the transition th on 


the input arc a in the color marking cM if 


i : 
oS cy and there exists no other token of color 
r 
c' in p, in the color marking CM such that 
Sie 
ay 
We note that there may exist tokens of colors 


eS :c" 


c and c' in the color bag cM (p,) such that c and 


c' are incomparable under the ordering relation 
<= and each of the two colors is a candidate 


: ae 
on the input arc a., in the 


enabling color of t a 


k 
color marking cM. An enabling color of the 
transition th on the input arc a in the color 
marking CM” is a color arbitrarily selected from 
the set of candidate enabling colors for that 


a ‘ 
transition tL on the input arc aa in the color 


i ; ae : : 
marking CM’. The input arc ay is said to claim 


the enabling token of color c in the color 
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; a 
marking CM . 
On the other hand, if there exists no color 


c es in the color bag cm” (p,) then there 


exists simply no candidate enabling color, and 
hence no enabling color, of tL on the input arc 

i, : r 
a.. in the color marking CM . 


fie the color 


Thus, 


cr actually sets a color threshold for the 

selection of the enabling color on the corres- 
ponding input arc a 
The bag of enabling colors of te from the 


input place P, in the color marking cM is: 


r 
Ok = 


<cilc is the enabling color of t on 


i 
the input arc ani in the color 


marking cM, AS aS Daas 


The bag of enabling colors of the transition t 


r. 
in the color marking CM is then: 


oP 
5 


We can now define the notion of an enabled 
transition. 


Definition 2.5. A transition cn is enabled 


ng 
in the color marking CM is following conditions 
hold: 


1. There is an enabling color in the color 
marking cM for each input arc of the 
+ i ; 
2 wake (p,) for each input place p, 
J 
of t 
k 


In other words, the transition tL is enabled in 


the color marking CM if each input arc of t. can 


k 
claim in CM a distinct enabling token. 

A transition enabled in some color marking 
may be selected to fire in that color marking. 
The rule under which this selection is made 
will be given later in this section. We shall 
first describe the effect produced by the firing 
of a transition. 


Definition 2.6. A transition the enabled 


‘ : r ’ ‘ 
in some color marking CM , fires in the color 


marking cm by performing the following 
operations: 


i ; 
1. For each input arc ry a ne a 


of t., a token of color 


k 
c is removed from the corresponding input place 


for all input places Ps 


Ps where c is the enabling color of Fe on the 


Lo? i r 
arc a, in the color marking CM . 


yk 


1 
2. For each output arc ans 


for all output places P. of the a token of color 
i 

ks’ 
corresponding output place Pas 


: i=l,...,0(t, Pp.) 


Flav), also denoted by C is placed in the 


Let us now present the rule under which an 
enabled transition is firable. 
Let us define for each P. €P the set: 


= t é > 
T(p,) {t, | , £7 and Tip, t,) o} 


We shall denote by E” the set of transitions 
; 2 
enabled in the color marking CM . 
Definition 2.7. There exists a conflict in 


the color marking cM” at the place p. for the 
color c if and only if: J 


r r 
) #H(c, 0.) > #lc,CM (p5)) 


xr 
tc Nr(p,) 


Here #(c,054) and #(c,CM" (p,)) denote the number 


of occurrences of the color c in the bags OF 


and cM (p,), respectively. In other words, a 
conflict for the color c has occurred in the 
color marking cM” at the place P. if the number 
of tokens of color c claimed from P, by the 
transitions enabled in cM is larger than the 
number of tokens of this color present in P, in 


the color marking CM’. Note that there may 
exist conflicts at the same place for several 
distinct colors. 

Two transitions t. and te are said to 


conflict in the color marking cM” (denoted by 
tot.) if following conditions are true: 
l. t €E andt €EF. 
k Ss 
2. There exists at least one place P, such 
that I t.) >0O and I(p.,t )>O0O and 
a (Dir ~ (Psy 5) 
there is a conflict at P, in the color 
marking cM for some color c, where 
xr r xr 4 
c €D(85,) ND(G;.). D(8. 4) and (6. .) 
denote the domains of the bags oT and 


O55" respectively. In addition we 


impose that t,o t, for each t, € E*. 


k 


The conflict relation defined above can 
be extended as follows. Let CONF be the relation 


on EY such that for any tL € E* and * E 


(tt ) € CONF if there exists a sequence of 
m 


transitions. 


Pe ae a © 


=t andt ,ot 
; ql qn m qj 


qjt+l 
for j=l,...,n-l. 

We have the following obvious proposition: 
Proposition 2.1. CONF is an equivalence 


relation on E’. An equivalence class of the 
relation CONF is called a conflict cluster. The 


r 
quotient set of E modulo CONF, denoted by CF’, 
is called the set of conflict clusters of the 


color marking CM. 
CT ,...,CT" 


M o 


i=1,...,@ denote individual conflict clusters 


r 
Let CF” = where CTs 


Yr 
present in CM . 


Definition 2.8. Given a conflict cluster 


r ; ar ; 
CT. a conflict subcluster of CT; is a maximal 


Yr 
subset scT; of oo such that for any two 


k 


r r 
iti € scr, ; t o 2 
transitions ts CTs. and t,€ScTsy. * te 


In order to simplify notations, let us 
denote by SCF, the set of all conflict subclusters 


of a conflict cluster er. Note that SCF, is 


covering of oT; but not necessarily a partition. 


A transition tL is said 


to cover another transition ag in the color 


Definition 2.9. 


marking CM” (denoted by t/t.) if the following 
condition hold: 
1. t €E andt €E. 
k m 
r r 
Ra “Gales (D(8,)) >g-1.b. (D(8)). 


Tf t/t in cM” we also say that tL has 
higher priority than a in the color marking cM. 
Definition 2.10. The set of majorants 


k is a maximal sub- 


such that if Ee € 


of a conflict subcluster scr, 


r rx 
set majiscT,, } of SCT,, 


r aaF 
maj{sct, 3 then there exists no other transition 


t €scr* and t /t . Note that a conflict sub- 
Ss 1k s m 
cluster may have several distinct majorants. 
Based on the above concepts, we define 
the conditions under which an enabled transition 


is firable. 


Definition 2.11. A transition t,. is firable 


; : a : ; 
in some color marking CM if and only if t isa 


majorant of all the conflict subclusters of cM 
in which it is contained. 
Firing Rule: Any transition which is 


firable in a color marking cM” may be selected 
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to be fired in CM’. 

The question can be raised whether the above 
rule can always enable one to find a firable 
transition from a nonempty set of enabled transi- 
tions. The answer to this question is positive 
as shown. 


Proposition 2.2. There exists a firable 
transition in any non-empty conflict cluster. 


Proof: Let et. be a non-empty conflict 
cluster and let 


r r r 
COL, = {g.1.b. (D(6_)) It. eer}. 


Since COL, CX, COL, # 7, there must exist a 
Maximal element of the set COL. let it be 
c. Suppose pccr, and g.1.b.(D(6)) = Gi 


Hence t, is a majorant of all conflict 
subelustare in which it is contained 


(recall that cF* is a partition of E’) and 
therefore firable. 


r : 
Note that g-1.b. (D(0,))=c, where c isa 
maximal element of the set COL, , is a sufficient 
condition for tL to be firable in the color 


marking cM but not a necessary condition. 

Let us now give a few definitions which we 
shall make use of in later sections. Suppose t 
is a transition firable in some color marking 


cm’ . Let us assume that cmt is a color mark- 


ing obtained after the firing of t in cM’ is 
+1 
denoted by CM*[t>cm. 
Definition 2.12. Given an arbitrary color 


Se \ oar 
marking cM’, a firing sequence from CM is a 
string Vy =t asa. Me of transition names 


kl ks 
Cy, € T*) such that there exist the color 
+1 +i- +i 
markings cM sa le=ics, and CM ame [t, > cM : 


r+ ; : . 
If CM = is a color marking obtained after 
the execution of the firing sequence ve from 


CM then we shall say that Y, can lead from CM” 
+ 

to the color marking cM is and we shall denote 

+ 

this fact by cM" LY, > cM 7 


Definition 2.13. A C-Colored Petri. Model 
(C-CPM) is a system CP = (CN,CM°) where: 


1. CN = (N,C,F) is a C-Colored Petri Net. 
2. CM° is the initial color marking of CN. 


Definition 2.14. Given a C-CPM CP=(CN,CM°), 
the set of firing sequences of CP is: 


S(CM°) = {y | yer* and cM? [y>cM , for some 
color marking cm}. 


Given a final color marking cut, the set of 
terminal firing sequences of CP is: 


t(cm°?,cMf) = {y|yer* and cme ty|om’}. 


Figure 2.1 illustrates an example of'a 
C-CPM and its firable transitions. 


cy SCTS, \ 


eee we Somme nd 


T= {t,-ty-t3| +; Pe {P) Pe Pe} 


x 
g-1.b. (D(8))) 


Figure 2.1 
Sample C-CPM 


3. Modelling Examples 


In this section we present two examples 
which illustrate the use of the C-CPM for 
modelling concurrent systems. We note in passing 
that it has been shown ([7]) that Generalized 
Petri Nets cannot correctly model a special case 
of our first example. 


3.1. A Producer-Consumer Synchronization Problem 


with Dynamic Priority Hierarchy. 


Consider the following process synchroniza- 
tion problem (Figure 3.1). The coordination 


system contains two buffers, By and B.: Two 


producer processes, Pia and Por and a consumer 
i 
process, C,, are connected to each buffer B,, 
i i 


i=l, 2. Either one of the producer processes 

can be activated at any time. At that time it 
produces and deposits an item in the associated 
buffer. A consumer process can become activated 
only if the associated buffer is nonempty. 
Following rules govern the access of each consumer 
process to its respective buffer. Consumer 
process Cc. may consume from B. items deposited 


there by Ps only if Be does not contain any item 


1 


produced by P. - Moreover, it is assumed that 


2 


Cc) and C., can consume items from the corresponding 


buffers only through a channel with capacity 1, 
i.e., through a channel which can accomodate 

only one consumer process at a time. Following 
priority rule determines which consumer process 


can seize the channel in case both c, and C, are 


Simultaneously active: 


- if Cc, attempts to consume an item produced 


199 


by Pio while Cc. attempts to consume an 


item produced by P then either consumer 


21 
process can gain control over the channel; 
- in all other cases Cc, has priority over 
C.- 
We note that the coordination system 
described above incorporates interrelated but 
distinct priority hierarchies: 

1. The consumer processes must select items 
from the respective buffers according to 
a fixed priority. 

2. Control over the channel is obtained in 
accordance with a varying priority 
hierarchy among the consumer processes. 

These coordination aspects can be modelled by 
the C-CPM in a natural way mainly due to the 
following specific features of this model: 

~ the rule used for the selection of ena- 

bling colors; 

- the dependence of the priority of a tran- 

sition on the current color marking and 
on the particular selection of the ena- 
bling colors. 

Figure 3.2 exhibits the C-CPM representation 

of this producer-consumer synchronization system. 


There, transitions tye toe .. and ty model the 


roducer processes P Pog 2 and P respec- 
Ee een gee: = pe = Oak Spt ee 


tively while the transitions t. and t. model the 


consumer processes Cc and Cyr respectively. The 
places Po and Ps represent the buffers BL and 


B respectively while Py can be viewed as the 


>! 
implementation of the shared channel. It can 
easily be verified that the firing sequences of 
the C-CPM of Figure 3.2 correctly model the 
coordination sequences which can be executed by 
the producer-consumer system under consideration. 
Using the C-CPM one can correctly model even 
more sophisticated producer-consumer synchroniza-~- 
tion problems of this type, involving several 
producer-consumer processes and more complicated 
priority hierarchies. 


3.2. A Process Coordination Problem with Conflict 


Let us examine the following process coordic 
nation problem, depicted schematically in Figure 
ce The system contains four buffers denoted 


by Bue Bor B, and B, Three processes, denoted 


Pe 


by P. and Proc(i), are connected to each 


ae 
buffer Bar i=1,2,3. The buffer B can be accessed 
only by the three processes, Proc(i) and is 
initially assumed to contain two undistinguish- 
able items, Any one of the processes P,, may Be 


activated at any time, At that time it produces 
and deposits an item in the associated buffer B., 
i 


The process Proc(i) becomes active only if the 
associated buffer B. and the shared buffer B are 


nonempty. If active, Proc(i) attempts to consume 
an item from both B, and B with the restriction 
| i 


Figure 3.1 ’ 
Producer-Consumer Synchronization 
System with Dynamic Priority Hierarchy 


Figure 3.2 
C-CPM of the Producer-Consumer Synchronization System 
of Figure 3.1 


that it may consume from B. an item produced by 
Pea only if a does not contain any item produced 


by P Moreover, the following priority rule 


ify 
is imposed: 

-1,. A conflict situation arises if the number 
of items currently contained in the 
buffer B is strictly less than the number 
of active processes Proc(i), In this 
case Proc(1) has highest priority to 
proceed, Proc(3) has lowest priority and 
Proc(2) has lower priority than Proc(1) 
but higher than Proc(3), 

2, Otherwise, either active process Proc(i) 

may proceed. 

We note that in both the situations only one 
process Proc(i) may proceed at a time, After the 
chosen process, say Proc(k), has consumed the 
respective items from By and B it will temporare 


ily deactivate itself, For example, we can 
assume that Proc(k) initiates at this stage an. 
abstract operation associated with it which 
requires an arbitrarily long (possibly null) 
period of time to execute, Only after the ab- 
stract operation is terminated will Proc(k) 
restore the item consumed from B and regain its 
active status, as soon as both B_ and B are 


nonempty. 
- Meanwhile, however, the pending active pro- 
cesses Proc(i), ifk, are polled again in order to 


determine which process proceeds next. From this 
point of view the processes Proc(i) operate asyn- 
chronously and concurrently. 

The characteristic aspect of this process 
coordination problem is the conflict situation 
which arises in case the number of available 


resources in a shared pool of homologous resources 
is less than the number of processes competing for 


the respective resources. The priority rule 
designed for solving the conflict situation can 
be represented in the C-CPM in a natural way 
mainly due to the conflict structure (conflict 
cluster, conflict subclusters, etc.), employed 
for the selection of a transition to be fired. 
Figure 3.4 exhibits the C-CPM representation of 
the coordination system of Figure 3.3. Adjacent 
to the places and transitions of Figure 3.4, we 
have indicated the entities of the coordination 
system they model. Let us examine more closely 
the models of the processes Proc(i). 
transition to is considered to model the initia- 


tion of the abstract operation associated with 
Proc(l) while ty models the termination of that 


operation. The presence of a token in place 
Pio signifies that the abstract operation is in 
execution. Conversely, the occurrence of a token 


in place p Signals that the respective opera- 


18 
tion is terminated. The processes Proc(2) and 
Proc(3) are modelled similarly. 

This type of coordination problems can 
certainly be extended further. In particular, 
we note that we have used a totally ordered 
priority hierarchy for solving the conflict 
Situation at the buffer B only in order to make 
the problem simple. Actually, more sophisticated 
priority hierarchies can be used. For example, 
we can make the priority of the processes Proc(i) 
depend on the particular selection of items 
from the associated buffers Bas In that case, 


the conflict situation could be solved according 
to a dynamic priority hierarchy similar to that 
described in Section 3.1. 


4. The Representation Capabilities of the C-CPM 


In the previous section we have presented 
two modelling examples in order to stress the 
usefulness of the C-CPM with respect to the 
representation of certain process synchronization 
structures. In this section we give some results 
of an in-depth analysis concerning the compara- 
tive representation capabilities of the C-CPM 
and other variants of the Generalized Petri Nets. 
The family of formal languages generated by a 
class of Petri Nets is assumed here to be an 
indication of the representation capability of 
that class of Petri Nets. For the analysis 
itself, the reader is referred to [ll]. Let us 
now define the families of languages in question, 


Definition 4.1. A Labelled C-Colored Petri 
Model is a system CPL=(CP,2,L) where: 


1. CP = (CN,CM°) is a C-CPM. 
2. £ is a finite alphabet. 
35. iG T > 2 is a function called the 


For example, 


ta | te 


pa 
Proc (1) 
ph Am 


Figure 3.3 
Process Coordination System with Conflict 


Figure 3.4 
C-CPM Representation of the Coordination System of Figure 3.3 


labelling function of CP... 


The definition of the labelling function can 
easily be extended to firing sequences. Let ty 
be a firing sequence of CP. Then L(ty) is 
recursively defined as follows: 


L(t) if y = A (A denotes the empty 


L(ty) = string) 


L(t)L(y) if y #2 


where L(t) L(Y) represents the concatenation of 
the strings L(t) and L(y). By convention, 
L(y)=A if and only if y=\, the empty firing 
sequence. 


Definition 4.2. S is the family of languages 
such that S€S if and only if there exists a C-CPM 
CP = (CN,CM°) where CN = (N,C,F), N = (T,P,1I,0) 
and S = S(CM°). Obviously, ScT*. . 
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So is the family of languages such that S ES, 


if and only if there exists a C-CPM CP = (CN,CM°) 


and a final color marking cmt of CN, cmt # CM°, 


such that S = T(CM°, cM‘) . Obviously, SCT*., 

Notice that the labelling function L as 
specified in Definition 4.1 is a nonerasing 
renaming homomorphism from T* into o*, For any 
family of languages A, we shall define the image 
of A under nonerasing renaming homomorphism to 
be the set: 


T(A) = {L(a)]a €A and L is a nonerasing 
renaming homomorphism on A}. 


The families of languages of interest to our study 
are then defined as follows: 


Definition 4.3. The family of computation 
sequence sets of the C-CPM is the family of 
languages A, where A = T(S). Similarly, the 
family of terminal computation sequence sets of 


the C-CPM is the family of languages Aor where 
A. 7 TS). 


Languages families defined similarly were 
studied in [5] and [9] with respect to the 
Generalized Petri Nets, the Inhibitor Nets (called 
Extended Petri Model in [11]) and the Priority 
Nets (called Priority Petri Model in [11]). Thus, 
a common basis of comparison between these models 
and the C-CPM is provided. 

In what follows, we shall give a few results 
regarding the language families A, and A defined 


above. Due to the complexity of the proofs 
involved and to the lack of space, the results 
will be given without proof. Detailed proofs can 
be found in [11]. 

As already mentioned in the introductory 
section, we have compared the modelling power of 
the C-CPM with the modelling power of the Extended 
Petri Model (EPM), the Priority Petri Model (PPM), 
the Coordination Petri Model (COPM) and the Petri 
Net Model with switches, disjunctive logics and 
token absorbers (PM). Defining A (BPM) , A, (PPM, 


A (COPM) , A, (P™) and A(EPM), A(PPM), A(COPM), 
A(PM) analoguous to AG and A, respectively, we 


have shown that: 
A 


O 


A, = {w - {A}|wec} = {w - {A}|wece}. 


Here C and CE denote the families of languages 
accepted in quasi-real-time by counter acceptors 
and E-counter acceptors (for exact definitions 
of the families C and CE see [11]) 

Also, A = A(EPM) = A(PPM) = A(COPM) and 
A ¢ A(PM). It is not known whether A(PM) CA 
as well and thus, whether A = A(PM). However, 
{w - {A}|w A(pm)} © AQ. 


For any family of languages A, the image 
of A under renaming homomorphism (not necessarily 
nonerasing) is the set: . 

H, (A) = {n(a) ja € and h is a renaming homo- 


A (EPM) = Aj (PPM) = Ag(COPM) = A, (PM) 


morphism on A}. Based on the equality results 
given above, all the results obtained in [5] with 
respect to the language families A(EPM), A(PPM) 
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A (EPM) and A, (PPM) can immediately be extended 
to the language families A and Ao respectively. 


In particular, from theorem 9.3 of [5] it 
follows that H, (A.) is the family of recursively 


enumerable languages. . 
Between the language families A and A, we 


have shown that there exists the following 
relationship: 


{w - fA}|weA} cA. 


The constructs used in order to prove the above 
results have allowed the following extensions 
in the definition of the C-CPM. We have shown 
that the language families AG and A are not 


affected if following color thresholds and output 
colors are used: 


i i ef 
- if i input arc th F(a =f 
if a,. is an inp arc en F( 5k 4k 


where f. :P(X%)>xX is a function (P(X) 


denotes the power set of X) and in any 


color marking cM’, the color threshold 
used in the selection of the enabling 


i og 
color for the input arc ae is 


an (D(cM (p.))); 
5k - (P. 


i 
- if 
if a. 
where fy iP (x) > is a function and in any 
color marking cm", the color of the output 
a. in the output place 
; ; rx i r 
i f M ig : 
p, in case t, fires in cM , is xg (P 68,,)) 


In fact, the language families AS and A are not 


i 
is an output arc then Flay) — a 


token placed by a 


affected even if the colors of the output tokens 
of a transition are established according to the 
following more general rule: 


i i 
- if is an output arc then F(a = f 
i ass Ss an outp ( eee 


where EV iP (X) x P(x) > x is a function and 


, r 
in any color marking CM the color of the 
token produced by th on the output arc 


1 . r i r 
j t fir j M is f 
a in case : es inc is xg (2 §9,) # 


D(cm™ (p.))) where 


Yr A 2 
CM (po) if I(p_,t,) 0 


cm” (p_) = 


cM (p.) 210" 


i > 
sk if T(poty) O. 


These extensions of the definition of the C-CPM 
are useful when one is interested in modelling 
the flow of control in reentrant subroutines, 
for example. The efficient and clear modelling 
of reentrancy becomes especially important when 
modelling real-life collections of programs such 
as compilers or reentrant portions of operating 
systems. 


Figure 4,2 exhibits our model of the flow of 
control in the reentrant subroutines of Figure 4,1 
Distinct control flow streams in operating in a 
reentrant subroutine are represented by distinct 
colors. We note that upon firing, transition 


t laces in a token of color c_ orc 
5 P Po 2 ey 


depending on its enabling color. On the other 
hand, due to the particular structure of the 


lattice of colors, the transitions t and ty 


can select only the colors Cy and C3, respec~ 


tively, as enabling colors from p Thus, control 


9° 
is always returned from the reentrant subroutine 
to the proper calling routine. 
We note that the purpose of the output color 
ter ‘ 
function fe is to "paint" the output tokens pro- 


duced by the transition ty with the same color 


as that of the tokens absorbed by the respective 
transition from its input place. In this way the 
continuity of the control flow streams operating 
concurrently in the reentrant subroutine is 
ensured. The usage of this type of mappings as 
output color functions enhances the naturalness 
of the representation of reentrancy. This method 
applies to nested reentrant subroutines as well, 
Consider for example, the synchronization 
Situation depicted schematically in Figure 4.3. 
There, both subroutines A and B may be invoked 
concurrently by different calling programs, 
Figure 4.4 displays the C-CPM representation of 
the flow of control in this synchronization 
system. Note that due to the structure of the 
lattice of colors, the return of control from 
subroutine B is correctly modelled, that is the 
control flow stream originating from PROGRAM 3 
can never be routed to subroutine A upon termina- 
tion of the execution of subroutine B and vice 
versa, 


PROGRAM 1 PROGRAM 2 


SUBROUTINE S 


Figure 4.1 
Sample Reentrant Subroutine 


in any color marking Of 


c, if (82) = fc 
£4 (DIE) = : 5) ~ 1) 
C3 if (65) = {c,} 


Figure 4.2 
C-CPM Representation of a Sample Reentrant Subroutine 


Figure 4.3 
Nested Reentrant Subroutines 


in any color marking ot 
: : fe, if 065) = feo} 
fo y5(Me5)) = 


i 
c, if 62) = fea} 


: ‘ i : 
lv Gy if es) is {4} 


Dery) = 
fp yg *(eQ)) 


c, if D6) «= {c,} 


c, if Heb) =fc4} 

| 

fy (64) = c, if D165) ={c5] 

ce, if 065) ~{cg} 
Figure 4.4 

Modelling of Nested Reentrant Subroutines 


5. Conclusions 


In Section 4 we have seen that the PPM, EPM, 
COPM, PM and the C-CPM are completely equivalent 
from the point of view of the A - language 


family generated. This fact is rather surprising 
if one considers that these models were intro- 
duced both independently and for distinct pur- 
poses. Each class of Petri Nets listed above was 
defined in order to offer natural representation 
of a certain class of synchronization systems 

of interest to the respective authors. There- 
fore, the equivalence results given in Section 4 
do not diminish the usefulness of any of these 
classes of Petri Nets, 

The modelling examples presented in this 
paper point out the kind of process synchroniza- 
tion structures for which the C-CPM offers a 
simple and clear representation which preserves 
the natural process structure. It is worthwhile 
mentioning that in [11] it was argued in detail 
that the EPM, PPM, COPM and PM fail to offer a 
natural and easy to understand representation for 
the process synchronization structures considered 
here, 
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Abstract —- A technique is presented for instrumenting 
Concurrent Pascal Systems at the component level. The key 
technical problem is that of implanting in each component a 
unique index which can be matched to the name of the com- 
ponent by the instrumentation package. The implantation 
technique requires two monitors for parent-parent synchroni- 
zation and parent-offspring synchronization. An additional 
monitor provides percent utilization statistics by component 
and a report process periodically produces summary figures. 


The entire procedure can be automated. The source 
Concurrent Pascal System can be read by an instrumentation 
program and the Instrumented Source Program produced as 
output. No modifications are required to the Concurrent 
Pascal Compiler, hence the same compiler can be used to 
compile and run either the instrumented or uninstrumented 
version of the system. 


Introduction 


Program instrumentation has been studied in a number 
of contexts in the several years since its introduction into the 
literature [1], [3], [4], [5], [6], [7], [8], [9]. Here we consider 
the problem of instrumenting programs which are written in 
Concurrent Pascal [2] for the purpose of determining compo- 
nent utilization. The placement of instrumentation probes in 
this environment is somewhat specialized since system com- 
ponents may be multiple realizations of system types. Hence 
the probes, which appear as syntax within the system types, 
must produce component specific statistics. 


It is necessary to imbed in each component a unique 
index which can be utilized to identify the component during 
the collection and reporting of statistics. This is accom- 
plished by a flip-flop monitor which synchronizes parent- 
offspring calls to a "census" table. Upon spawning an off- 
spring, a parent enters its name in the table. The offspring 
then retrieves the table index of the name. The monitor is 
written to insure that these events occur in proper order. 
Section two describes this process in detail. 


There is also the problem of simultaneous spawning of 
offspring by several parent components. It may be necessary 
to queue a number of parents to wait for access to the census 
table. It is expedient to handle this problem by a second 
monitor which acts as a virtual facility which must be seized 
by each parent in turn. Section three describes the details of 
parent-parent synchronization. | 


Section four contains a description of the syntax of each 
probe along with a set of rules for placing the probes. It 
follows from the rules that the placement of probes can be 
automated. Hence it is possible to write an instrumenting 
system which will automatically instrument an arbitrary Con- 
current Pascal system. 
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Section five presents a sample, producer-consumer, sys- 
tem. The uninstrumented system is given first and is fol- 
lowed by the instrumented version. 


The details of the program for automatic instrumentation 
of Concurrent Pascal systems are not covered here. They are 
intended to be the subject of a future report. 


Parent-Offspring Synchronization 


The IMPLANT monitor is given below. It is realized by 
a system component named IMPLANTER. The parent com- 
ponent alternately performs IMPLANTER.NEWNAME('k,') 
and init k, end. The entry procedure NEWNAME enters 'k,’ 
in the census table and updates the table index. In the parent 
component, the statement init k,,.....k end is replaced by 
alternate calls to NEWNAME and individual init statements 
as given by probe SI in section four. 


The first executable statement of each system type 
(which must of. necessity be realized as one or more off- 
spring) upon instrumentation, is changed to the second exec- 
utable statement. The new first statement is 

COMPONENTNO:=IMPLANTER.INDEXNO 
In this way, the table index is implanted in the component. 
This index value can then be used as a parameter of any calls 
to attribute gathering components. For instance, if we wish 
to monitor component utilization, we can have a CLOCK 
monitor, realized by CLOCKER. we can then execute 

CLOCKER.START (COMPONENTNO) 
or 

CLOCKER.STOP (COMPONENTNO) 


The CLOCK monitor syntax is not given here since the 
coding for this monitor is obvious. There is also the need for 
a separate process type TABULATE (realized by TABULA- 
TOR) which will periodically output the statistics gathered. 


The following constants and types are global to the 
Instrumentation System Types. They can be added to the 
beginning of the program being instrumented. 

constNOCOMP = maximum number of components al- 
lowed, 
IDLENGTH = 
identifier; 
type STRING = packed array [1...IDLENGTH] of char; 
(The key words packed array can be replaced by array if: 
necessary) 


maximum number of characters in an 


The IMPLANT monitor program is the following: 


type IMPLANT = monitor ; 

var INDEX : [1..NUMCOMP] ; FATHER : boolean ; 
COMPTAB : array [1..NUMCOMP] of STRING ; 
FQUE, CQUE : queue ; 

procedure entry NEWNAME (NAME : STRING) ; 


begin 
if not FATHER then delay (FQUE) ; 
FATHER : = fake ; 
COMPTAB[INDEX] : = NAME ; 
continue (CQUE) 


end ; 
function entry INDEXNO : [1..NUMCOMP] ; 
begin 
if FATHER then delay (CQUE) ; 
FATHER : = rrve ; 
INDEXNO : = INDEX ; 
INDEX : = INDEX +11; 
continue (FQUE) 
end ; 
begin 
INDEX := - 1; 
FATHER : = rue 
end ; 


Parent-Parent Synchronization 


The SPAWN monitor, realized by SPAWNER, provides 
for an array of queues for parent components which are 
waiting for access to the census table. In Concurrent Pascal, 
the queue data type is allowed to delay and thus hold all 
systems information for one component only. Hence the 
programmer must provide for an array of queues if more than 
one component can be delayed at a time. 


Before any parent can execute a statement of the form 
init k,,....k end it must first execute 
SPAWNER.SEIZE 
After the init statement, it must execute 
SPAWNER.RELEASE 
These two statements are part of probe SI mentioned before 
and described in the next section. 


The queue array is programmed as a queue data struc- 
ture in the usual way: 


type SPAWN = monitor ; : 

var LINE : array [0..NUMCOMP] of queue BUSY : boolean ; 
HEAD , TAIL : integer : 

procedure entry SIEZE ; 


begin 
if BUSY then 
begin 
TAIL : = TAIL +1; 
delay (LINE [TAIL mod NUMCOMP)]) 
end 
else BUSY : = true 
end ; 
procedure entry RELEASE ; 
begin © 
if HEAD < TAIL then 
begin 
HEAD := HEAD +1; 
continue (LINE [HEAD mod NUMCOMP]) 
end 
else BUSY : = false 
end ; 
begin 
HEAD : = -1 ; TAIL : =-1 ; BUSY = false 
end ; . 
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Instrumentation Probes 


All together there are eight different types of probes. 
The following list gives each probe a name (e.g. A,C,G, etc.), 
describes what the probe does and gives the syntax of the 
probe. In the syntactical description, square brackets indicate 
syntax which may be required by the contex and braces indi- 
cate a choice of syntax. The variation in syntactical detail is 
quite minor and probe placement can be automated. 


A: Definition of access rights to instrumentation monitors 

C  : Definition of component number variable 
(COMPONENTNO) 

G_: Start clock 

I : Implant component number value 

ICD: Instrumentation Components definition 


: Instrumentation Components initialization 
S : Stop Clock 
SI : "Sequentialization" of init statement 


A: 4 [SPAWNER:SPAWN;]IMPLANTER: 
IMPLANT;CLOCKER:CLOCK }é 


C_: [var] COMPONENTNO:[1..NOCOMP]; 

G : CLOCKER.START(COMPONENTNO)[end][;] 

I : COMPONENTNO:=IMPLANTER.INDEXNO; 
ICD : SPAWNER:SPAWN;IMPLANTER:IMPLANT; 
CLOCKER:CLOCK;TABULATOR:TABULATE; 

: init SPAWNER,IMPLANTER,CLOCKER, 
TABULATOR end ; 

S : [begin] CLOCKER.STOP (COMPONENTNO) [3] 
SI: replace init k,,....K, end by 

[begin] SPAWNER.SIEZE; IMPLANTER.NEWNAME( ! k,’ ); 
init k, end;...,.IMPLANTER.NEWNAME('k '); init k, end ; 
SPAWNER.RELEASE [end][;] 


ICI 


The rules for probe placement can be summarized as follows: 


A: Placed in all system type definition statements. If the 
type has no access rights, the parenthesis are used 
else the comma and blank are used to append access 
to the instrumentation monitors. If the system type 
contains an init statement, access to the SPAWNER 
must be included. | 


B : _ Placed in the variable definition block of each system 
type. If the block is void, var must be included. 
G : Placed as the new first executable statement of each 


entry and (with I preciding it) at the beginning of 
each initialization routine. Also, placed after each 
delay. If delay is part of a structured statement, end 
and/or a semicolon may be required. 

I: Placed as the first executable statement of each ini- 

tialization routine. 

Placed in the variable definition section of the initial 

process. 

Placed as the first executable statement of the initial 

_- process. 

S : Placed before each delay. If delay is part of a struc- 
tured statement, begin may be required and a semico- 
lon is required. Also placed before each continue. If 
continue is part of a structured statement begin may 
be required and a semicolon is required. Also placed 
before each final end of each entry and initialization 
routine which is not immediately preceded by cycle ... 
end ; or continue (...); : 

SI: Replaces all init statements. 


Example 


In the following Producer-Consumer system, the Produc- 
er is producing (pseudo) random integers in the range O to 
1740. The Consumer is keeping a running tally of the largest 
and smallest numbers produced. The bin between the Prod- 
ucer and the Consumer holds ten numbers and is treated as a 
stack. The unistrumented system is given first and the place- 
ment of probes is indicated in the second program listing. 


Uninstrumented Producer — Consumer Example 


type PRODUCE = process (B:BIN) ; 
var V : interger ; 
procedure MAKE (var V: integer) ; 


begin 
V : = (V*1979) mod 1741 
end ; 
begin 
V:= 761; 
cycle MAKE (V) ; B.SEND (V) end ; 
end ; 


type CONSUME = process (B:BIN) ; 
var V, HI, LO: integer , 
procedure USE (V:integer) ; 


begin 
if V > HI then HI: = V; 
if V < LO thenLO:=V; 
end ; 
begin 
HI: = -1; LO: = 1741 ; cycle B.KRECEIVE (V) ; 
USE (V) end ; 
end ; 


type BIN = monitor ; 
var EMPTY, FULL : boolean ; I: 0..10 ; PQ, CQ : queue ; 
STACK : array [1..10] of integer ; 
procedure entry SEND (V:integer) ; 
begin 
if FULL then delay (PQ) ; 
I: =I+1; EMPTY: = fale ; 
STACK [I]: = V; 
if 1 = 10 then FULL: = true ; 
continue (CQ) 
end ; 
procedure entry RECEIVE (var V:integer) ; 
begin 
if EMPTY then delay (CQ) ; 
V: = STACK [I]; 
if I = 0 then EMPTY : = true ; 
continue (PQ) 
end ; 
begin 
I: =0; FULL: = false ; EMPTY : = true 
end ; 
var PRODUCER : PRODUCE ; CONSUMER : CONSUME 
; BUF: BIN ; 
begin 
init BUF, PRODUCER (BUF), CONSUMER (BUF) end 


’ 


end. 


Probes indicated by boldface type 


type PRODUCE = process (B:BIN A) ; 
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var V : integer; € 
procedure MAKE (V:integer) ; 
begin 
V : = (V*1979) mod 1741 
end ; 
begin 1G 
V;=761; 
cycle MAKE (V) ; § B.SEND (V) G end ; 
end; 
type CONSUME = process (B:BIN A ) ; 
var V, HI, LO: integer; € 
procedure USE (V:integer) ; 
begin 
if V > HI then HI: =V; 
ifV < LO thnLO:=V; 


end ; 

begin 1 G 
HI: = -1; LO; = 1741 ; eyvcle § B,RECEIVE (V) G ; 
USE (V) end ; 

end ; 


type BIN = monitor A; 
var EMPTY, FULL : boolean ; I: 0..10 ; PQ, CQ : queue ; 
STACK : array [1..10] of integer ; € 
procedure entry SEND (V:integer) ; 
begin G 
if FULL then § delay (PQ) G ; 
I: =I1+1; EMPTY: = fabe; 
STACK [I]: = V; 
if | = 10 then FULL: = rte ; 
S$ continue (CQ) 
end ; 
procedure RECEIVE (var V:integer) ; 
begin G 
if EMPTY then § delay (CQ) G; 
V: = STACK [I]; 
I: =I-1; FULL: = false ; 
if 1 = 0 then EMPTY : = true ; 
S$ continue (PQ) 
end ; 
begin 1 G 
I: = 0; FULL: = false ; EMPTY : = true Ss 
end ; 
var PRODUCER : PRODUCE ; CONSUMER : CONSUME 


- BUF: BIN; ICD 
begin ICl 


init BUF, PRODUCER (BUF), CONSUMER (BUF) 


end ; 


end. 
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A DEMON LANGUAGE COMPILER ON A NETWORK FOR PARALLEL CONTROL 
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Summary 


We have designed and implemented a demon 
language for our experiments in heuristic control 
of loosely coupled complex systems [1]. The 
compiler itself is written in Fortran on a PRIME 
400, but it generates code for TECHNEC [2], a_ ring 
network of [SI-lls running in parallel. A critical 
problem is the control of galactic variables, 
variables whose values must be available 
simultaneously on several nodes. 


A program for controlling a complex system may 
be relieved of the need to regulate everything at 
once, through the presence of independent 
procedures that monitor system behavior and take 
corrective action when they detect some particular 
pattern of events. In the artificial intelligence 
field such monitors are called demons. In programs 
that are to be distributed over a network computer, 
demons are independent enough to run in parallel, 
and they may be programmed separately. 


A demon has five features: a trigger event, 
an evocation condition, a procedure body, inter- 
faces for output to the outside world, and 
scheduling policies. When the trigger event 
occurs, the evocation condition is evaluated. If 
the condition evaluates "true", the procedure body 
is executed. The procedure may use the output 
interfaces to control actuators to influence the 
Physical environment or may cause other demons to 
be evoked. The scheduling policies control such 
things as whether the demon is active ("paying 
attention" or not), and how soon the procedure must 
be executed after the event and condition cause it 
to be evoked. 


Trigger events can be the completion of a time 
interval, the arrival of a message, or the 
modification of a monitored variable. So that 
demons need not keep scanning variables, a list is 
kept for each variable of all demons monitoring it. 
The system is designed to accomodate multiple main 
programs, each with its set of "owned" demons and 
procedures. The demons owned by a main program may 
inhabit different nodes than their owner. 


The compiler is organized as a single pass 
with some forward references left to be resolved by 
the RYf-ll assembler. It is written in the PRIME's 
recursive extended Fortran, which makes it simple 
to use recursive descent parsing. The compiler 
generates indirect threaded code [3] partitioned 
into separate modules for the ISI-lls. 


Design of the runtime environment is strongly 
affected by the fact that the ISI-lls have no 
common memory so all communicaton of values of 
variables between them must take place via 
messages. The demon language has several kinds of 
variables. Variables declared within a procedure 
or a demon are local to the subprogram in which 
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they are declared and do not need special handling. 
Variables declared within a main program are global 
to all demons and _ procedures owned by that main 
program. These variables may need to be 
transmitted to other nodes if they are parameters 
in procedure calls or if some demon in another node 
needs their values. The variables which cause the 
greatest problems are those we have named galactic 
variables, those with values which must_ be 
simultaneously available on several nodes. 


Galactic variables provide communication 
between different main programs, and provide demons 
on remote nodes with fast access to certain 
variables. Global variable storage is attached to 
each main program and is accessible by all demons 
and procedures owned by that main _ program. 
Galactic variables, on the other hand, are kept in 
a common storage area which is replicated on each 
node. There is only one instance of a galactic 
variable per node and access is done indirectly 
through pointers. This storage is common to all 
main programs, demons, and procedures existing on 
that node. When a program unit changes the value 
of a galactic variable that value is instantly 
available to all program units on the node. A 
broadcast message is also sent around the ring to 
update the value of that galactic variable in the 
galactic storage on all the other nodes. 


Changes to galactic and global storage are 
made through calls to the demon supervisor, a 
collection of demon support routines resident on 
each node, which buffer variable access on that 
node. They alert appropriate demons when aé key 
variable is changed and to initiate the propagation 
of galactic variable change messages around the 
ring. They also allow the buffering of variable 
access between demons and a main program ona 
different node. It is inevitable in this kind of 
loosely coupled system that there will be times 
when copies of the same galactic variable on 
different nodes will have different values. The 
system is designed to be robust enough to survive 
in this situation. 
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AN ALGORITHM FOR THE CONCURRENT UPDATE 
OF MULTIPLE-COPY DATABASES 


Mohamed G. Gouda and Robert G. Arnold 
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Summary 


The problem of concurrent updates for multiple- 
copy distributed databases has received much attention 
in recent years. A good survey of the problem and of 
some algorithms to solve it can be found in [1]. | 
Although there are no quantitative studies to estimate 
the performance of different algorithms, most 
algorithms seem to require a large number of 
communications between the different sites in the 
distributed database system. Thus the system perfor- 
mance may be reduced greatly. 

To minimize the communication overhead, a recent 
algorithm [2] takes advantage cof the locality of 
reference which is inherent in some database systems. 
The algorithm assumes that in 95% of the cases an up- 
date request generated at some site will only access 
data items local to that site. The algorithm allows 
deadlocks to occur; and proposes a deadlock detection 
scheme in which a global picture of the distributed 
system is constructed and analyzed by one machine 
called the "SNOOP". This seems to contradict the 
distribution philosophy of a distributed system. 

In this paper, we present an alternative algo- 
rithm to solve the problem. Like [2], our algorithm 
assumes locality of reference: but it does not cause 
deadlocks; thus deadlock detection and resolution 
(whether centralized or distributed) are not needed. 

For convenience, assume that each site has a 
complete copy of the database. But instead of lock- 
ing and releasing all copies of a data item, we desig- 
nate one specific copy of the data item to be locked 
and released. The site which has the designated copy 
of a data item is called the primary site [1] of the 
data item. Different data items may have different 
primary sites. 

: Each site s has an update process p which re- 
ceives update requests from the site users and 
executes them in cooperation with other update pro- 
cesses at other sites. On receiving an update request 
from a user at site s, p examines whether the request 
is local or global. If the primary site of all the 
data items in the request is s, then the request is 
local, otherwise it is global. Because of locality 

of reference, most update requests are local. 


Processing Local Updates: 
To process a local update request u, the update 


process p adds u to the wait queue at site s. u waits 
in the wait queue until all its required data items 
are released. u is then executed on the database 
copy at s; then it is broadcasted to all other sites 
in the system where it is executed on other database 
copies. : 


2:10 


Processing Global Updates: 
To process a global update request u, the update 


process p sends u to a number of sites, namely, the 
primary sites of the data items required by u. At 
each one of these sites, u locks the required data 
items in the site; then moves to the next site. After 
locking all the required data items, u returns to site 
Ss where it is executed on the database copy at s. 

Then u is broadcasted from s to all other sites to be 
executed on the other database copies, and to release 
the data items locked by u. 

If global requests are allowed to travel between 
sites in different orders, then deadlocks can occur. 
To avoid deadlocks, define an arbitrary total order 
<on the different sites in the system such that if 
an update request u has to visit sites s and s' and if 
s<s' then u must visit s before s'. This is a 
sufficient condition for freedom of deadlocks. 

In the paper, the algorithm is extended to 
avoid starvation and to handle partially redundant 
systems. Currently, we are working on the fault- 
tolerance and performance analysis of the algorithm. 
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COMPLEXITY MEASURES OF COMPUTER STRUCTURES 
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Summary 


The central theme of this paper is the measur- 
ing of the complexity of computer structures. We 
show that some commonly used computer classificati- 
on schemes are based on the notion of complexity 
measure. A special feature of such measures called 
structure preservation gives the computer architect 
a convenient tool for the considerations of machine 
flexibility and complexity, and for the comparison 
of computers and applications with regard to their 
structure. 


of new computer designs based 
advances of LSI makes it in- 
creasingly difficult to recognize which concept is 
Suitable for a given application. Performance cri- 
teria alone ~- such as the number of MOPS, where 

the number of ALU’s or processors is simply multi- 
plied by the number of instructions executed per 
second by each ALU or processor - are not suffici- 
ent for this purpose. Only a part of the executed 
instructions will actually be useful for the appli- 
cation, while the remainder is required by the over- 
head (cf. functional and procedural/movement instr- 
uctions in [1], or execution and transformation in 
(2]). The ratio of the overhead instructions to the 
useful instructions (non functional ratio [1]) can 
become great if the structure of the application 
does not fit the computer structure. 


The great number 
on the technological 


The only structural feature of computers taken 
into account in the 1960’s was the machine word 
width. In order to make the parallelism of various 
computers more explicit Flynn [3] introduced a cla- 
ssification scheme that remains still useful. But 
one regards a computer that executes multiple inst- 
ruction streams Simultaneously to be "more parall- 
el" than one executing a single instruction stream, 
and similarly in the case of the multiple data 
streams. This is a quantitative aspect going beyond 
the mere classification. In accordance with other 
similar situations in mathematics and computer sci- 
ences, we could speak of a (rather simple) paralle- 
lism measure of computers. However, the amount of 
the quantified information is too small here, even 
if we additionally distinguish between the serial 
and parallel ALU operation [4]. Feng [5] considers 
another parallelism measure of computers, taking 
the word width and multiplying it by the number of 
such words that can be processed throughout the 
computer simultaneously. The measure makes no diff- 
erence between the various machine levels at which 
the parallelism is achieved. 


The scheme introduced in [4],f6] possesses this 
last property. It uses the following characteris- 


tics to compare computers: 

k the number of processors connected in parallel 
ie 7 ‘ " phases in macro-pipelining 

d the number of ALU’s connected in parallel 

a " " phases in instruction pipelining 
w the ALU wordwidth 

w’ "™ number of phases in arithmetic pipelining. 


{1] Flynn,M.J 


[2] 


[3] 
(4] 


[ 


C 
[ 
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The suggestive notation (kxk’,dxd’,w=xw’) for the 
corresponding 6-tupels shows at a glance the amount 
of parallelism and pipelining of the computer at 
the main three machine levels, e.g. (1*1,641,64*1) 
for an Illiac IV-quadrant, or (1x1,1*10,60~1) for 
the main processor of the CDC 6600. The component- 
wise ordering relation induces a partial ordering 
over these 6-tupels that allows us to compare some 
computers, or, on the other hand, that shows which 
computers are incomparable with regard to their 
structure. We speak generally of a complexity mea- 
sure of computer structures, since not only paral- 
lelism but also pipelining and the hierarchy of op- 
eration are quantitatively characterized here. 


Certain complexity measures possess the special 
property that their target partially ordered set 
can be equipped with operations that correspond to 
operations on the source set, i.e. computers or ap- 
plications in our case. Thus the expression 
(10,1,12) * (1,1*10,60) - for the sake of simplici- 
ty, we omit the component "xi" - characterizes the 
complete CDC 6600 with attached peripheral proces- 
sors. The operator "v" ("alternative") can be used 
in order to show the different possible modes or 
structures of the computer. We obtain the follow- 
ing expression for the pilot EGPA configuration 
(cf. a companion paper [7]): [(1,1,32)v(1,32,1)] x 
*[(4,1,32)v (1%4,1,32)¥(1,4,32)V(1,128,1)]. 


Structure preserving measures have already pro- 
ved to be useful for the comparison of computers 
(4],[6]. They could be even much more useful in 
the search for optimal embeddings of application 
structures into fixed computer structure [7]. The 
establishing of formal tools for this purpose is 
one of the most ambitious aims of our present work. 
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PARALLEL MEMORY SYSTEM FOR A PARTITIONABLE SIMD/MIMD MACHINE 
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Abstract -- PASM is a Large-scale partition- 
able SIMD/MIMD multimicroprocessor system being 
designed for image processing tasks. To improve 
machine throughput, a memory management system 
employing parallel secondary storage devices and 
double-buffered primary memories has been de- 
vised. The memory system is an intelligent one, 
using communicating microprocessors which are 
dedicated to handling data requests and file 
management. The memory system bus structure is 


organized to exploit much parallelism in 


transferring data from the secondary memories to 


the primary memories of virtual SIMD and MIMD 
machines. 
I. Introduction 
As a result of the microprocessor revolu- 


is now feasible to build a dynamically 
reconfigurable Large-scale multimicroprocessor 
system capable of performing image processing 
tasks more rapidly than previously possible. 
There are several ways to harness the parallel 
processing power of a multimicroprocessor system: 
SIMD, MSIMD, MIMD, and PSM. 


tion, it 


An SIMD (single instruction stream - 
multiple data stream) machine [5] typically con- 


sists of a set of N processors, N memories, an 
interconnection network, and a control unit (e.g. 
Illiac IV [2]). The control unit broadcasts in- 
structions to the processors and all active 
‘C"turned on") processors execute the same in- 
struction at the same time. 
cutes instructions using data taken from a memory 
to which only it is connected. The interconnec- 
tion network allows interprocessor communication. 


An MSIMD (multiple-SIMD) system is a parallel 
processing system which can be structured as two 
or more independent SIMD machines (e.g. MAP 
(163). An MIMD (multiple instruction stream - 


multiple data stream) machine [5] typically con- 
sists of N processors and N memories, where each 
processor may follow an independent instruction 
stream (e.g. C.mmp [38]). As with SIMD architec- 
tures, there is a multiple data stream and an in- 
terconnection network. A PSM (Cpartitionable 
SIMD/MIMD) system [22] is a parallel processing 
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Each processor exe- 
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system which can be structured as two or more in- 
dependent SIMD and/or MIMD machines Ce.g. PASM 
[28]). 

PASM, a particular PSM-type system for image 
processing and pattern recognition, is currently 
being designed at Purdue University [22]. Due to 
the low cost of microprocessors, computer system 
designers have been considering various multimi- 
croprocessor architectures [e.g. 3, 9, 12, 13, 
17, 34, 36]. The system described here was’ the 
first in the Literature to combine the following 
two features: 

1) it may be partitioned to operate as 
dependent SIMD and/or 
sizes; and 

2) a variety of problems in image processing and 
pattern recognition are being used to guide the 
design choices. : 

In the next section, a brief overview of 
PASM is presented. The sections following 
describe various aspects of the PASM memory sys- 
tem. The use of parallel secondary storage dev- 
ices, double-buffered primary memories, and dedi- 
cated microprocessors for memory management are 
discussed. 


many in- 
MIMD machines of varying 


II. PASM Overview 


PASM, a partitionable SIMD/MIMD system ([22, 
28, 291, is a dynamically reconfigurable multimi- 
croprocessor machine for image processing. It is 
a parallel processing system which can be struc- 
tured as one or more independent SIMD and/or MIMD 


machines of varying sizes. A block diagram of 
PASM is shown in Figure 1. 
The heart of the system is the Parallel 


Computation Unit, which contains N processors, N 
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Figure 1: Block diagram overview of PASM. 
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Figure 2: The Parallel Computation Unit. 
memory modules, and the interconnection network. 
The Parallel Computation Unit processors are mi- 
croprocessors that perform the actual SIMD and 
MIMD computations. The Parallel Computation Unit 
memory modules are used by the Parallel Computa- 
tion Unit processors for data storage in SIMD 
mode and both data and instruction storage in 
MIMD mode. The interconnection network provides 
a means of communication among the Parallel Com- 
putation Unit processors and memory modules. 

The Micro Controllers are a set of micropro- 
cessors which act as the control units for the 
Parallel Computation Unit processors in SIMD mode 
and orchestrate the activities of the Parallel 
Computation Unit processors in MIMD~ mode. 
Control Storage contains the programs for the Mi- 
cro Controllers. The Memory Management System 
controls the loading and unloading of the Paral- 
lel Computation Unit memory modules from the 
Memory Storage System. The System Control Unit 


is a conventional machine, such as a PDP-11, and 


is responsible for the overall coordination of 
the activities of the other components of PASM. 
The Parallel Computation Unit is organized 


A pair of memory units is 
Computation Unit memory 


as shown in Figure 2. 
used for each Parallel 


module so that data can be moved between one 
memory unit and secondary storage while the 
Parallel Computation Unit processor operates on 


data in the other memory unit. The Parallel Com- 
putation Unit processors, which are physically 


numbered (addressed) from 0 to N-1, where N=2", 
communicate through the interconnection network. 
The interconnection network being considered is a 
variation of the data manipulator [4], a multi- 
stage implementation of the "PM2I" network (20, 
21, 24, +%(‘32], called the Augmented Data 
Manipulator (ADM) [30]. Other possibilities are 
cube and shuffle-exchange type networks (10, 11, 
17]. Any of these interconnection networks can 
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TO SYSTEM CONTROL UNIT 


PROC. O 
MC MEM. 
ot ~~ 
PROC. N-Q 
PROC. } MC 
MEMORY 
PROC. Qt! SYSTEM 
SWITCH 
PROC. N-Q+] 
PROC. Q-] 
PROC. 2Q-1 
PROC. N-] 
Figure 3: Organization of the Micro Controllers 
(MCs). 
be partitioned into independent sub-networks of 
varying sizes, which are powers of two, if the 
physical addresses of the 2P processors’ and 
memory modules in a partition have the same n-p 
low-order bits (30, 31]. 
The method used to provide multiple control 


units is shown in Figure 3. There are Q=2% Micro 
Controllers, physically addressed (numbered) from 


0 to Q-1. Micro Controller i controls the N/Q 
Parallel Computation Unit processors whose lLow- 
order q physical address bits equal i. Each Mi- 


cro Controller has a memory module which contains 


a pair of memory units, allowing memory Loading 
and unloading and computations to be overlapped. 
A virtual SIMD machine of size RN/Q, where 


R=2" and 0 < r<q, is obtained by loading R Mi- 
cro Controller memory modules with the same in- 
structions simultaneously. Similarly, a virtual 
MIMD machine of size RN/Q is obtained by combin- 
ing the efforts of the Parallel Computation Unit 
processors of R Micro Controllers. For either 
SIMD or MIMD mode, the physical addresses of 
these R Micro Controllers must have the same 
Low-order q-r bits since the physical addresses 
of all Parallel Computation Unit processors in a 
partition must agree in their low-order bit posi- 
tion in order for the interconnection network to 
function properly. 

Given a virtual machine of size RN/Q, the 
Parallel Computation Unit processors and memory 
modules for this partition have Logical addresses 


(numbers) O to (RN/Q)-1,R=2°,0<r<aq. 
Assuming that the Micro Controllers have been as- 
signed as described above, then the Logical num- 
ber of a Parallel Computation Unit processor or 
memory module is the high-order rt+n-q bits of the 
physical number. Recall that all of the physical 
addresses of the processors in a partition must 
have the same q-r low-order bits. For example, 
for N = 1024, @ = 16, and R = 4, one allowable 
choice of Parallel Computation Unit processors to 
form a partition of size RN/Q is those whose phy- 
sical addresses are 3, 7, 11, 15,...1023. The 


high-order rtn-q = 8 bits of these 10-bit physi- 
cal addresses are 0, 1, 2, 3,..-255, respective- 
ly. The value of the Low-order q-r = 2 bits of 
all the physical processor addresses are equal to 
three. 

Similarly, the Micro Controllers assigned to 
the partition are Logically numbered (addressed) 
from 0 to R-1. For R > 1, the Logical number of 
a Micro Controller is the high-order r bits of 
its physical number. Recall all of the physical 
addresses of the Micro Controllers in.a partition 
must agree in the low-order q-r bits. For R = 1, 
there is only one Micro Controller and it is con- 
sidered logical number 0. For example, if N = 
1024, Q = 16, and R = 4, one allowable choice of 
four Micro Controllers is those whose physical 
addresses are 3, 7, 11, and 15. The high-order r 
= 2 bits of these four bit physical addresses are 
0, 1, 2, and 3, respectively. The value of the 
Low-order q-r = 2 bits of all the physical Micro 
Controller addresses are equal to three. 

This brief overview of PASM is provided as 
background for the following sections. More de- 
tails about PASM and partitionable interconnec— 
tion networks can be found in [22-31]. 

The Memory Management System in PASM will 
have its own intelligence and will use the paral- 
lel secondary storage devices of the Memory 
Storage System. As guidelines for design pur- 
poses, it is assumed that N and Q@ are at Least 
1024 and 16, respectively. (Systems with 214 to 
216 microprocessors have been proposed [17, 361.) 
Giving the Memory Management System its own in- 
telligence will help prevent the System Control 
Unit from being overburdened. The parallel 
secondary storage devices will allow fast loading 
and unloading of the N double-buffered Parallel 
Computation Unit memory modules and will provide 
storage for system image and picture data and 
MIMD programs. The Memory Management System and 
Memory Storage System are described further in 
the following sections. 7 


IIT. PASM Memory Storage System 


Secondary storage for PASM's Parallel Compu- 
tation Unit memory modules’ is provided by the 
Memory Storage System. The Memory Storage System 
will consist of N/Q@ independent Memory Storage 
units, where N is the number of Parallel Computa- 
tion Unit memory modules and @ is the number of 
Micro Controllers in PASM. The Memory Storage 
units will be numbered from 0 to (N/Q)-1. Each 
Memory Storage unit is connected to Q Parallel 
Computation Unit memory units. For 0 < i < N/Q, 
Memory Storage unit i is connected to those 
Parallel Computation Unit memory modules whose 
physical addresses are. of the form: 

(Q* i) +k, O<k <Q. 
Recall that, for 0 < k < Q, Micro Controller k is 
connected to those Parallel Computation Unit pro- 
cessors whose physical addresses are of the form: 


(Q* i) tk, OS 1 <N/Q, 
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PCU PE# 


Organization of the Memory Storage 
System for N = 32 and @ = 4. 

MSSU is Memory Storage System unit. 
MC is Micro Controller. PCU PE is 
Parallel Computation Unit Processing 
Element (processor - memory module 
pair). 


Figure 4: 


Thus, Memory Storage unit i is connected to the 
ith Parallel Computation Unit processor/memory 
module pair of each Micro Controller. This is 
shown for N = 32 and Q = 4 in Figure 4. 

The two main advantages of this approach for 
a partition of size N/Q are that (1) all of the 


Parallel Computation Unit memory modules can be 
loaded in parallel and (2) the data is directly 
available no matter which partition (Micro Con- 


troller group) is chosen. This is done by stor- 


ing the data for a task which is to be _ Loaded 
into the ith Parallel Computation Unit memory 
module of the virtual machine of size N/Q in 
Memory Storage unit i, O < i < N/Q. Memory 


Storage unit i is connected to the ith Parallel 
Computation Unit memory module in each Micro Con- 
troller group (i.e., Parallel Computation Unit 
memory modules Q * i, (Q* i) +1, (Q * i) + 
2,ecee)e Thus, no matter which Micro Controller 


group of N/Q Parallel Computation Unit processors 
is chosen, the data from the ith Memory Storage 
unit can be Loaded into the ith Parallel Computa- 


tion Unit memory module of the virtual machine, 
for all i, 0 < i < N/Q, simultaneously. 
For example, in Figure 4, if the partition 


of size N/Q = 8 chosen consists of the Parallel 
Computation Unit processors connected to Micro 
Controller 2, then Memory Storage unit 0 would 
load Parallel Computation Unit memory module 2, 1 
would load 6, 2 would load 10, etc. If instead 
Micro Controller 3's Parallel Computation Unit 
processors were chosen, Memory Storage unit 0 
would load Parallel Computation Unit memory 
module 3, 1 would load 7, 2 would load 11, etc. 
Thus, for virtual machines of size N/Q, this 
secondary storage scheme allows all N/Q Parallel 
Computation Unit memory modules to be loaded in 
one parallel block transfer. This same approach 


(N/a) /2¢ 


are 


district 
available, 


Memory 
where 


can be taken if only 
Storage System units 


O<d< n-q. 
block loads 
one. 

Consider 


In this case, however, 2d parallel 
would be required instead of just 


the situation where a virtual 
machine of size RN/Q is desired, 1 < R < Q, and 
there are N/Q Memory Storage System units. In 
general, a task needing RN/Q Parallel Computation 
Unit processors, logically numbered 0 to RN/Q-1, 
would require R parallel block loads if the data 
for the Parallel Computation Unit memory module 
whose high-order n-q Logical address bits equal i 
is Loaded into Memory Storage unit i. This is 
true no matter which group of R Micro Controllers 
(which agree in their low-order q-r address bits) 
is chosen. 


For example, consider Figure 4, where N = 32 


and Q@ = 4. Assume that a virtual machine of size 
16 is desired. The data for the Parallel Compu- 
tation Unit memory modules whose Logical ad- 


dresses are 0 and 1 is loaded into Memory Storage 
unit 0, for memory modules 2 and 3 into unit 1, 
for memory modules 4 and 5 into unit 2, etc. As- 
sume the partition of size 16 is chosen to con- 
sist of the Parallel Computation Unit processors 
connected to Micro Controllers O and 2 (Ci.e., all 
even physically numbered processors). Then the 
Memory Storage System units first load Parallel 
Computation Unit memory modules physically ad- 
dressed 0, 4, 8, 12, 16, 20, 24, and 28 (simul- 
taneously), and then load memory modules 2, 6, 
10, 14, 18, 22, 26, and 30 (simultaneously). As 
explained in section II, given this assignment of 
Micro Controllers, the Parallel Computational 
Unit memory module whose physical address is 2 * 
j has logical address i, 0 < i < 16. Assume the 
Parallel Computation Unit processors and memory 
modules associated with Micro Controllers 1 and 3 
are chosen. First memory modules physically ad- 
dressed 1, 5, 9, 13, 17, 21, 25, and 29 are load- 
ed simultaneously, and then modules 3, 7, 11, 15, 
19, 23, 27, and 31 are loaded simultaneously. In 
this case, the Parallel Computation Unit memory 
module whose physical address is (2 * 1) + 1 has 
logical address i, 0 < i < 16. No matter which 
pair of Micro Controllers is chosen, only two 
parallel block loads are needed. 

Thus, for a virtual machine of 


size RN/Q@, 
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this secondary storage scheme allows all RN/Q 
Parallel Computation Unit memory modules to be 
Loaded in R parallel block transfers, 1 <R <Q. 
As stated above for the special case where R = 1, 
the same approach can be taken for R > 1 if only 


(N/Q) /29 distinct Memory Storage System units are 


available. In this situation, however, R * 2d 
parallel block loads would be required instead of 
just R. 

The actual devices that will be used as 
Memory Storage System units will depend upon the 
speed requirements of the rest of PASM, cost con- 


straints, and the state of the art of storage 
technology at implementation time. Possibilities 
to be “investigated include disks, bubble 


memories, and CCD's. 


The PASM Memory Management System makes’ use 
of the double-buffered arrangement of the Paral- 
Lel Computation Unit memory modules to _ enhance 
system throughput. The scheduler, using informa- 
tion from the System Control Unit such as number 
of Parallel Computation Unit processors needed 
and maximum allowable run time, will sequence 
tasks waiting to execute [29]. Typically, all of 
the data for a task will be Loaded into the ap- 
propriate Parallel Computation Unit memory units 
before execution begins. Then, while a Parallel 
Computation Unit processor is using one of its 
memory units, the Memory Management System can be 
Loading the other unit for the next task. When 
the task currently executing completes, the 
Parallel Computation Unit processor can switch to 
its other memory unit for doing the next task. 

Based on image processing and pattern recog- 
nition tasks which have been examined, the fol- 
lowing conclusion has been reached. Due to the 
use of double-buffering, the potentially large 
Parallel Computation Unit memory modules, and the 
special purpose design of PASM, the time sharing 
of the Parallel Computation Unit processors and 
the use of conventional paging is not desirable. 

There may be some cases where all of the 
data will not fit into the Parallel Computation 
Unit memory space allocated. Assume a memory 
frame is the amount of space used in a Parallel 


Computation Unit memory unit for the storage of 
data from secondary storage for a particular 


task. There are tasks where many memory frames 
are to be processed by the same program (e.g., 
maximum Likelihood classification of satellite 
data [35]). The double-buffered Parallel Compu- 
tation Unit memory modules can be used so that as 
soon as the data in one memory unit is processed, 
the Parallel Computation Unit processor can 
switch to the other unit and continue executing 
the same program. When the Parallel Computation 
Unit processor is ready to switch memory units, 
it signals the Memory Management System that it 
has finished using the data in the memory unit to 
which it is currently connected. Hardware to 
provide this signaling capability can be provided 
in different ways, such as using interrupt lines 
from the Parallel Computation Unit processors or 
by using logic to check the address Lines between 


the Parallel Computation Unit processor and its 
memory modules for a special address code. After 
the appropriate tests to ensure that the new 
memory frame is available [29], the processor 
switches memory units. The Memory Management 
System can then Load the "finished" memory unit 
with the next memory frame or next task. Such a 
scheme, however, requires some mechanism which 
can move variable Length portions of programs or 
data sets (j.e., local data) stored in one unit 
of a memory module to the other unit when the 
associated processor switches to access the next 
memory frame. . 

Three hardware methods are 
implementing local variable storage. Each would 
be used only when multiple memory frames are _ to 
be processed. The first method consists of a 
separate Local memory allocated to each Parallel 
Computation Unit processor for the purpose of 
storing local variables. This local memory would 
be in addition to the processor's memory module. 
Such a Local memory would not be affected by the 
changing connections of memory units associated 
with it. The second method would consist of 
splitting the Local variable storage, and using a 
variable Length portion of each memory unit = as 
local variable storage. This scheme would re- 
quire w/2 words of storage in each memory unit to 
implement w words of local variable storage. 
This specially allocated space in the memory un- 
its would be protected by hardware when the asso- 
ciated Parallel Computation Unit processor 
changes memory units. The third method stores 
Local variables in the memory units in much the 
same way as method two, but in this case w words 
are required in each Parallel Computation Unit 
memory module for w words of local variable 
storage. This scheme preserves. Local variable 
storage by maintaining a current copy of the lo- 
cal variables in both memory units associated 
with a given Parallel Computation Unit processor. 
Of the three methods described above, method 
is the least flexible since it requires a 


considered for 


one 


fixed amount of memory to be dedicated to Local 
variable storage at all times. This method may 
tend to utilize inefficiently the special Local 
variable memories it requires since these 


memories will have to be large enough to handle 
tasks which may require amounts of local variable 
storage many times greater than that of a typical 
job. For example, a task may require that a por- 
tion of a reference image be stored within the 
Local variable storage space. Such a task might 
be executed infrequently but would require a rel- 


atively large amount of local variable storage 
space. Other tasks run by PASM might be executed 
far more frequently but would require far Less 


Local variable storage space than the reference 
image example above. The result is that while 
the tasks requiring a small portion of the avail- 
able local variable storage space are being run, 
the bulk of the available Local variable storage 
space is not utilized. Furthermore, if a task 
requires more local variable storage than expect- 
ed (i.e., more Local variable storage than the 
fixed size dedicated memory has space for), a 
problem arises which will require additional 
hardware and/or software overhead to solve. 


The second method described above makes’ the 
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most efficient use of the memory space available 
for local variable storage in that for w words of 
Local variable storage required, only w words of 
actual memory space are used. Since w would be 
variable, only the amount of Local variable 
storage space required by a given task would need 
to be allocated to the task. This method, howev- 
er, has several inherent disadvantages. First, 
when a Parallel Computation Unit processor exe- 
cuting a given task begins processing the Last 
memory frame associated with the task, the Memory 
Management System will normally load the inactive 
memory unit with data for the next task to be 
run. If the local variable storage system is in 
use however, the next task cannot be Loaded into 
the memory unit since the w/2 words of local 
variable storage in the inactive memory unit must 
be preserved until the current task is complete. 
A second disadvantage of this method is that the 
Parallel Computation Unit processor addresses 
which access Local variable storage stored in the 
inactive memory unit must be translated to prop- 
erly address’ the local variable storage in this 
memory unit. Such address translation is Likely 
to require additional hardware and may cause ad- 
ditional delay in address decoding. 

The third method described above would make 
less than optimal use of the space allocated for 
Local variable storage in the Parallel Computa- 
tion Unit memory module (2w words of the memory 
module are needed for w words of local variable 
storage), but it does not require the address 
translation of method two and provides much more 
flexibility than method one. It also eliminates 
the problems encountered in method two when a new 
task is loaded into a memory unit which contains 
Local variable storage associated with a previous 
task. This method maintains a copy of local 
variables in both memory units associated with a 
given Parallel Computation Unit processor so that 
switching memory units does not alter the Local 
variable storage associated with the processor. 
The implementation of variable size local vari- 
able storage for this method is simpler and more 
straightforward than that of method two above 
since the total address space for a single pro- 
cessor is fixed at the size of a single memory 
unit. In method two, the total address space 
would be the fixed size of a memory unit plus 
w/2. More image processing and pattern recogni- 
tion algorithms suitable for implementation on 
PASM need to be studied to determine if the effi- 
ciency gained by optimal utilization of memory 
space in method two will be significant enough to 
offset the problems associated with this method. 
Currently, method three appears to be the most 
promising. 

One possible hardware arrangement to imple- 
ment method three is described below. The ar- 
rangement makes use of two characteristics of the 
PASM memory access requirements: 

1) secondary memory will not be able to load a 
given memory unit at the maximum rate it can ac- 
cept data, and 

2) Parallel Computation Unit processors will not 
often be able Cor desire) to write to memory on 
successive memory cycles. 

Because of these two characteristics, Parallel 
Computation Unit processor stores to local vari- 


able storage locations in an active memory unit 
can be trapped by a bus interface register and 
stored in the inactive memory unit by stealing a 
cycle on the secondary memory bus. In essence, 
this technique makes use of the conventional 
store-through concept as described in [7, 14]. 

An exception to the second characteristic 
mentioned above is multiple precision data. If 
16 bit words are assumed, then for higher’ preci- 
sion it may be desirable to use two or four words 


as a group. However, a simple buffering scheme 
can handle this possibility. 
The amount of memory allocated as_ local 


storage is determined by the contents of a k-bit 
base register. This register may be altered by 
the Memory Management System. If 2P Locations 


are available in each memory unit for Parallel 
Computation Unit processor use, Local storage can 


pacatioested an biockseat =o” B blocks, 


1<B S20, can be allocated for local storage by 
storing B in a base register. This has the ef- 
fect of allocating all memory Locations from 0 to 


words. 


Bok 1 as local storage. When a processor writes 
to a Local variable location a k bit block com- 
parator causes the memory address and data being 
written to be trapped by a bus interface regis- 
ter. A cycle request flip-flop is set to indi- 
cate to the logic which controls the buses asso- 
ciated with the Parallel Computation Unit memory 


module that a cycle is needed on the secondary 
memory bus. When the cycle is granted, the 
flip-flop is reset and the data in the bus in- 
terface register is gated into the inactive 
memory unit. In this way, the space allocated 
for local variable storage remains updated in 
both memory units at all times. It is assumed 


that the bus interface register will have maximum 
priority for secondary memory bus usage since 
this would prevent the processors in the Parallel 


Computation Unit from having to wait to write to 
a location designated for use as local storage. 


The method described above is applicable to 
any system which allows its processing tasks to 
utilize several separate memories and which re- 
quires that identical copies of variable amounts 
of certain data be maintained in all memories so 
used. 


V. Altering Loading Sequences 


To further increase the flexibility of PASM, 
a task may alter the sequence of data processed 
by it during execution. As an example, consider 
a task which is attempting to identify certain 
features within a series of images. The task 
might examine a visible spectrum copy of an image 
and, based on features identified within the im- 
age, choose to examine an infrared spectrum copy 
of the same image. Rather than burden the System 
Control Unit to perform data loading sequence al- 
terations, the task is allowed to communicate 
directly with the Memory Management System. 

In the case of an SIMD task, the associated 
Micro Controller(s) determines if changes are re- 
quired in the data loading sequence for the task. 
If so, a Micro Controller specifies the nature of 
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the changes and communicates them to the Memory 
Management System without involving the System 
Control Unit. Each Micro Controller in the PASM 
system has the capability to generate Loading se- 
quence changes. For tasks which require R Micro 
Controllers (1 <R < Q), Logically numbered 0 to 
R-1, control instructions exist so that Logical 
Micro Controller QO will handle Loading sequence 
changes. Micro Controller 0 uses Logical Paral- 
lel Computation Unit processor number O of the 
virtual machine to establish a control informa- 
tion List in logical Parallel Computation Unit 
memory module 0. (There are Q Parallel Computa- 
tion Unit processors which can possibly be lLogi- 
cally numbered 0 in a virtual machine. They are 
those Parallel Computation Unit processors which 
are physically numbered 0, 1, 2,..-,Q-1.) This 
List specifies in a concise fashion the loading 
sequence alterations required and includes infor- 
mation such as_ the IDs of the data files to be 
loaded, the Parallel Computation Unit memory 
modules which are to receive the data, and the 
Locations within the Parallel Computation Unit 
memory modules where the data is to be loaded. 
The Micro Controller initiates the transfer of 
this List to the Memory Management System by us- 
ing Logical Parallel Computation Unit processor 0 
to write a pointer to the List into the highest 
addressable memory Location of its memory module. 


Through the use of a simple address comparison, 
the write into this memory location generates an 
interrupt to the Memory Management System. The 


Memory Management System recognizes the interrupt 
as a request for a loading sequence change and 
determines which Micro Controller is making the 
request. The Memory Management System uses the 
List of control information (via the pointer pro- 
vided) to determine the loading sequence changes 


required. 
An alternative method of interrupt genera- 


tion is to use an interrupt Line from each of the 
Q@ possible logical Parallel Computation Unit pro- 
cessor O's to the Memory Management System. The 
method selected for interrupt generation will 
depend upon the interrupt capabilities of the mi- 
croprocessor used in the Parallel Computation Un- 
it. While loading sequence control information 
could be passed directly from the Micro Controll- 
ers to the Memory Management System, the Length 
of the connections required may make implementa- 
tion more difficult and costly. 

One hardware scheme which can transfer’ the 
control information List from a Parallel Computa- 
tion Unit memory module to the Memory Management 
System is shown in Figure 5. The hardware system 
shown is based on having the Memory Management 
System coordinate the recognition of Micro Con- 
troller interrupts and the associated transfers 
of control information Lists from the Parallel 
Computation Unit memory modules. The interrupt 
recognition portion is handled by the Parallel 
Computation Unit processor Interrupt Conrol Logic 
while the transfer of control information Lists 
is handled by the Parallel Computation Unit 
memory module Access Control Logic. 

Consider the following example in a virtual 
machine whose processor Logically numbered 0 is 
physically numbered i. Suppose processor i es- 
tablishes a control information List in one of 


Interrupt Interrupt a 
Request Accept Select Address Data 


Processor 
interrupt 


Control Logic 


Memory Module 
Access 


Control Logic 


Memory Management System 


Figure 5: Hardware scheme for dynamically 
altering the loading sequence of 


the memory modules. 


jts memory units and writes a pointer to the List 
into its corresponding interrupt generation Loca- 
tion. The memory write to the interrupt genera- 
tion location is signaled to the Interrupt Con- 
trol by a pulse on the Interrupt Request Line 
corresponding to processor i. This pulse causes 
the Interrupt Control to signal the Memory 
Management System that processor i has generated 
an interrupt to the Memory Management System. 
The Memory Management System then uses the Access 
Control to read the interrupt generation Location 
in the Parallel Computation Unit memory module to 
obtain the pointer to the control information 
List. The control information List is then read 
from the Parallel Computation Unit memory module 
by the Memory Management System. Finally, the 
Memory Management System signals the Interrupt 
Control to generate a pulse on the Interrupt Ac- 
cepted Line to processor i. 

The same hardware arrangement described for 
SIMD tasks is used for MIMD tasks. With each 
group of N/Q MIMD processors, there is associated 
a memory supervisor which is Logical processor 0 
within the group. The memory supervisor 
possesses the hardware for Memory Management Sys- 
tem interrupt generation and loading sequence al- 
terations using the same arrangement described 
for SIMD mode. ALL processors associated with a 
give memory supervisor make requests for loading 
sequence changes through the memory supervisor, 
without involving the Micro Controllers or System 
Control Unit. This reduces System Control Unit 
contention problems, as mentioned above, and 
helps prevent the Micro Controller(s) orchestrat- 
ing the virtual MIMD machine from becoming over- 
burdened. 

The scheme described here is well suited to 
parallel computer systems which execute multiple 
parallel tasks since it can easily keep track of 
and arbitrate multiple requests for data Loading 
sequence alterations. This technique also makes 
efficient use of the multiple memory arrangement 
of PASM by using the hardware structure of the 
memory system to provide for communication of 
Loading sequence alteration information from the 
memory system to the controller which loads data 
into the memory system. 
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VI. PASM Memory Management System 


Tasks for which the Memory Management System 
is responsible include file system maintenance, 
scheduling of Parallel Computation Unit memory 
module loading and unloading, and Memory Storage 
System bus control. A set of microprocessors are 
dedicated to performing the Memory Management 


System tasks in a distributed fashion, i.e. one 
processor will handle Memory Storage System bus 
control, one will handle the scheduling tasks, 


etc. This distributed processing approach is 
chosen in order to provide the Memory Management 
System with a Large amount of processing power at 


low cost. In addition, dedicating specific mi- 
croprocessors to certain tasks simplifies both 
the hardware and software required to perform 
each task. 
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Figure 6: Distributed Memory Management System. 


The basic architecture of the Memory Manage- 
ment System is shown in Figure 6. The Memory 
Management System consists of a master processor 
which coordinates the concurrent tasks executed 
by the slave processors, a shared memory for 
storage of data required by more than one proces- 
sor, a local ROM and RAM for each processor for 
storage of code and local data respectively, and 
an interface to the shared memory for each pro- 
cessor. 

A shared memory approach is used to allow 
the processors to communicate with each other and 
to share data. This approach is planned due _ to 
the need to share relatively Large quantities of 
data such as file tables and task queues. As an 
example, consider a queue of Memory Storage Sys- 
tem to Parallel Computation Unit memory module 
data transfer operations pending. This queue 


would need to be accessible to both the processor 
in charge of the Memory Storage System bus system 


and the processor in charge of scheduling = such 
transfers. 

To reduce contention for the shared memory, 
each processor uses a local ROM and RAM for 


storage of code and local data. In addition, the 
shared memory may be interleaved £14] to further 
reduce contention. The degree of interleaving 
desirable may be determined by simulation studies 
or queuing theory analysis [C6] of the Memory 
Management System. 

The processors within the Memory Management 
System may be implemented using commercially 
available fixed instruction set microprocessors. 
The new generation of 16-bit processors (15, 18, 
19, 33] are particularly attractive since many 
provide special hardware for operations such as 
Locked increment and test, memory protection and 
management, and problem/supervisor state switch- 
ing. Features such as these would considerably 
simplify the hardware and software design of the 
Memory Management System. An alternative to. the 
16-bit processors are the less expensive 8-bit 
microprocessors currently available [8, 37]. The 
choice of a processor type will be governed by 
the amount of processing required to perform the 
tasks associated with the Memory Management Sys- 
tem and the cost trade-offs involved. 

The division of tasks chosen is based on the 
main functions which the Memory Management System 
must perform. The functions to be performed in- 
clude: 

1) communication with the System Control Unit and 
generating slave tasks based on Parallel Computa- 
tion Unit memory module load/unload requests from 
the System Control Unit, 

2) interrupt handling and generating slave tasks 
for data loading sequence changes requested by 
the Parallel Computation Unit processors’ physi- 
cally numbered 0 to Q-1 (see previous section), 
3) scheduling of Memory Storage System data 
transfers, 

4) control of input/output operations involving 
peripheral devices and the Memory Storage System, 
5) control and maintenance of the Memory Manage- 
ment System file directory information and the 
creation and deletion of data files, and 


6) control of the Memory Storage System bus sys~- 
tem. 
Most Memory Management System operations will 


be initiated by the System Control Unit since it 
will be responsible for coordinating the opera- 
tion of the PASM system. For this reason, the 
master processor is chosen to communicate with 
the System Control Unit and to perform the task 
spawning operations associated with System Con- 
trol Unit requests. 


Parallel Computation Unit processor inter- 
rupt handling is assigned to one slave processor. 
This slave sends requests for Parallel Computa- 
tion Unit memory module data loading sequence 
changes to the master processor. 

Scheduling of all Memory Management System 
operations involving data transfers using the 
Memory Storage System bus system is assigned to 
another slave processor. One slave processor is 
devoted solely to performing scheduling opera- 
tions since the scheduling of data transfers will 
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be complex and time consuming if near optimal 
operation of this system is to be realized. 

Another slave is devoted to handling 
input/output between the Memory Storage System 
and peripheral devices such as magnetic tape un- 
its and color video displays. This slave would 
handle any communications with the peripheral 
devices and schedule access to the Memory Storage 
units. 

The control and maintenance of the Memory 
Management System file system is assigned to one 
or more slave processors. To understand why mul- 
tiple slave processors may be required, consider 
the configuration of the Memory Storage System. 
It will consist of N/Q secondary storage devices 
which operate ina parallel fashion. The secon- 
dary storage devices will be required to locate 
and transfer data files based on file IDs 
presented to the Memory Management System. For 
the suggested values of N=1024 and Q=16, a_ total 
of 64 secondary storage devices may be involved 
in transferring data files at any given time. It 
is apparent that the file Location operations as- 
sociated with this many devices will exceed the 
processing capabilities of one slave processor. 
The exact number of slave processors to be devot- 


ed to file directory maintenance will be deter- 
mined by simulation and/or queuing theory = ana- 
lyses of the Memory Storage System system. 


Another possibility is to assign a microprocessor 
to each Memory Storage System unit for file 
directory maintenance (e.g. intelligent disks), 
and have a single slave coordinate this activity. 

A slave processor is devoted to performing the 
operations associated with the configuration and 
control of the Memory Storage System bus system. 


This would involve setting the control signals 
needed to connect each Memory Storage System unit 
to the appropriate Parallel Computation Unit 


memory module. 

The hardware structure of the Memory Manage- 
ment System is such that additional slave proces- 
sors may be added to perform tasks that are not 
considered to be part of the Memory Management 
System processing load at this time. In an actu- 
al prototype Memory Management System, interfaces 
for additional slave processors would be provided 
to facilitate system expansion and the incorpora- 
tion of new features into the Memory Management 
System. 


VII. Conclusions 

An overview of PASM, a partitionable SIMD/MIMD 
system for image processing and pattern recogni- 
tion being designed at Purdue University, was 
given. To improve the throughput of this large 
scale dynamically reconfigurable multimicropro- 
cessor system, a highly parallel memory system 
was described. The memory system uses double- 
buffered primary memories, parallel secondary 
memories, and a set of dedicated microprocessors. 
The organization of this memory system was 
presented and its advantages were discussed. 
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ABSTRACT - The paper describes an adaptable 
pipeline system with dynamic architecture. The system 
performs the following pipeline adaptations toward 
executed programs: 1) adaptation on the number of 
pipeline stages when each instruction activates the 
number of stages in the pipeline which matches the 
number of operations it realizes: 2) adaptation on 
operation sequences when pipeline may execute any 
sequence of operations requiring no reconfiguration and 
thus no time overhead caused by this reconfiguration; 
3) adaptation on operation time in each stage when each 
stage of the pipeline may execute an operation during 
minimally required time, because it may change the 
time of operation. The paper shows that such a pipeline 
may be organized from DC groups and thus amenable to 
LSI implementation. 


INTRODUCTION 


Pipelining is an attractive choice for a faster 
computation. However since pipeline computations are 
dedicated, systems in which pipelines are used are 
eonfronted with problems of time overheads caused by 
disparity between the pipeline(s) and executing 
algorithms. To. broaden their cost effective applica- 
tions, pipeline systems employ various reconfiguration 
techniques. General ideas of such reconfigurations is to 
reconfigure the system via software in order to reduce 
dissimilarity between the system and the program. 

Consider how existing systems use recon- 
figuration. All pipeline systems are divided into two 
categories: unifunctional and multifunctional. A 
unifunetional pipeline executes a single dedicated 
sequence of operations [1-3] . A multifunctinal pipe- 
line system may execute several sequences of opera- 
tions either in parallel or sequentially [3-8] . For 
multifunctional pipeline systems, each allowable 
sequence of operations is executed in the pipeline 
configured into a matching sequence of operational 
units. This means that to perform transition from one 
sequence(s) of operation to another, the system has to 
reconfigure. 
processing is performed, the time of reconfiguration is 
a pure time overhead which has to be minimized. To 
organize pipelined computations, the tasks requiring the 
same configuration are grouped together. On com- 
pleting their execution in one configuration the system 
reconfigures and starts execution of a new block of 
tasks in the next configuration. | 

It then follows that one of the major drawbacks of 
existing reconfigurable pipeline systems are their 
inability of instantaneous transition from one sequence 
of operations to another. 


Since during this reconfiguration no — 


Furthermore, since many 
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programs usually have different sequences of operations 
following each other, pipeline type of computations may 
be used cost-effectively only in a limited sense. Thus a 
limited applicability is the most severe drawback of 
existing pipeline systems. Their other drawbacks are 
associated with the following causes: 

(1) Frequent disparity between the number of 
consecutive operations in the instruction and the number 
of pipeline stages in the pipeline which processes this 
instruction. This creates additional delays (dummy time 
intervals) associated either with instruction propagation 
through unneeded stages or with conflict resolution when 
the instruction bypasses the unneeded stages and 
encounters operands prepared for some of _ its 
predecessors [9]. 

(2) In existing pipelines the time for non-iterative 
processor dependent operations (addition, subtraction, > , 
ete.) is permanent and does not depend on operand sizes. 
However, selection of a permanent operation time in 
each stage requires that it be selected as the time of the 
longest operation (addition handling maximal word sizes). 
It then follows that all faster operations (processor 
dependent operations handling smaller word sizes or 
Boolean operations, ete.) are slowed down because they 
are executed during the time of the longest processor 
dependent operation. 

(3) A significant additional waiting time occurs in 
existing pipelines when the computational result obtained 
for one instruction in one stage is required as an operand 
for another instruction in another sfags. For instance, if 
the instruction executes A*B+t(C°-D”) the temporary 
result A*B ean not be sent to the memory of some 
pipeline stage which might need it in _ future 
computations. Instead, all temporary results needed by a 
pipeline stage has to be grouped together and then sent 
to its memory. Or, in order not to waste time on 
grouping and transferring blocks of information, the 
needed temporary results are repeatedly computed. 


MOTIVATION 


As follows. from the shortcomings of existing 
reconfigurable pipeline systems, a further performance 
improvement can be accomplished if a pipeline system is 
provided with the following architectural solutions: 

(a) the system minimizes the time overhead caused 
by reconfiguration from one pipeline to another, 

(b) it equips each pipeline with variable number of 
stages. This means that each instruction must activate 
the number of stages in a pipeline which matches the 
number of consecutive operations it implements, i.e., an 
instruction implementing w operations has to pass 
through w stages and complete its execution. 


(c) the system provides each pipeline stage with 
the variable time interval for operation. Then each stage 
will be capable of generating a minimal operation time 
specified either by the sizes of operands for processor- 
dependent operations or small permanent duration for 
processor independent operations. Consequently the 
pipeline equipped with variable time interval in each 
stage will be capable of working at a variable rate and 
fan out results much faster if it is filled with short 
operations. 

(d) the system provides for fast exchanges of 
temporary results between pipeline stages. This will 
allow one-line feeding of temporary results obtained in 
one stage to other pipeline stages that will need them in 
the future. 

The architectural solutions which provide pipeline 
systems with these desirable characteristics can be 
obtained if a pipeline system is assembled from Dynamic 
Computer groups (DC-group) discussed in Ho, 11]. Such 
system was called a dynamic pipeline architecture. 
General concepts of a dynamic pipeline assembled from 
DC-groups were introduced in [12]. 

Given paper is dedicated to their further 
development. It introduces new research results on the 
following subjects: 

(1) organization of a pipeline stage capable to 
generate minimal time intervals for all its operations; 


(2) organization of effective information 


exchanges between pipeline stages; 

(3) addressing procedures which allow on line 
sending of temporary results to the destination pipeline 
stages. 


DYNAMIC PIPELINE: GENERAL CONCEPTS 


Hardware resource for a dynamic pipeline may be 
organized as follows 12] . It ineludes a single computer 


supervisor, Co» and several k-h-bit computers, Cy Cos 


ebay Cy, forming consecutive pipeline stages (Fig. 1). 
Computer Co stores instructions in memory Mo and 


fetches them to processor P,. The C, computer's size 
matches that of one instruction. Each pipeline stage C. 
has memory M. for storing initial data and addresses, 
processor P and general register set m., which stores 
temporary results required by the P. processor. These 
are either computed by P. or by other processors. 

Each connecting element MSE. separates two 
pipeline stages C. and C,,, and may assume two modes 


of transfer: right and no transfer. For the right 
transfer, MSE. transfers the pipelined instruction to the 


next right stage C; +1 with delay of one interval. For the 


ele te 
ae 


= Ce ea Cy ee 


FIGURE 1 
Hardware diagram of a dynamic pipeline 
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no transfer, the instruction received by MSE. is not 
transferred to the next C. +1 The MSE element is 
implemented as a universal module UM equipped with the 
modular control organization {10, 11]. 

Address elements ASE,, ..., ASE,, are connected 
into a shifting sequence, and transfer address portion of 
the program instruction (AI instruction). Each ASE is an 
8-bit microprocessor, UM, equipped with two levels of 
memory ME-1 and ME-2. These memories store constants 
which may modify codes and addresses stored in AI 
instructions. Each time a pipeline stage C. begins 
execution of the operation assigned to it by the program 
instruction, the address element ASE,, with the same 
position i, receives the AI instruction shifted from ASE. : 

It then follows that ASE, is synchronized by stage C; and 
stores AI instruction during the time C; executes 
operation. If the result of this operation is a future 
temporary result for another stage C., it is written to its 
general register set m.. To do this the ASE; sends the 
address for the m, sét to stage C., which thereafter 
broadeasts the con\putational result to the destination. 
Thus the number of MSE elements is F-1 and the number 
of ASE elements is F, where F is the number of pipeline 
stages. 


P | instruction 


A | instruction 


FIGURE 2 


Formats of PI and AI instruction 


Each instruction fetched from M, to P, includes 
two portions: the pipeline portion, PI, the address portion, 
AI (Fig. 2). By passing through the bus made of MSE 
connecting elements, the pipeline portion, PI, propagates 
through consecutive pipeline stages with a delay of one 
interval, causing execution of an operation assigned to 
each stage. Concurrently, the address portion, Al, of the 
instruction propagates through the bus made of ASE 
elements and specifies a pipeline stage,C., which should 
output the result, and a general register set, m., which 
should receive this result. 


FORMATS OF THE PI AND AI INSTRUCTIONS 


In order to adapt to the operation, it is necessary 
that each PI store its own opcode D. However, when the 
same D propagates through the pipeline stages, it will 
activate the same operation in each stage. In order that 
each stage, C., execute an individual operation, C; should 
store a position code, d. which is the binary value of its 
position within the pipeline. For instance for the C 
stage, qd, = 001, for Co stage dy = 010, ete. These two 
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codes, D and d. achieve the selective activation of an 
operation assigned to stage C. by instruction PI. 
Adaptation to the length of the pipeline is 
performed with another code, w, which shows the 
number of consecutive pipeline stages which execute 
instruction. PI. Clearly w matches the number of 
consecutive operations which are realized in PI. By 
propagating through each connecting element, MSE., 
containing position code d., w is compared with di. ff 
d.< w, MSE; propagates this PI instruction to the next 
stage with delay of one time interval. If d.2w, MSI; 


blocks further instruction execution. Therefore, by 
writing a code w into the PI instruction field, one may 
select the number of pipeline stages required by a 
single instruction. 

This technique allows one to perform a simple 
adaptation of the number of pipeline stages that 
requires no bypassing of unneeded stages and no 
conflict resolution associated with such bypassing. 
The entire problem is solved by the code w with size w 
= log F bits, where F is the maximal length of the 
dynamic pipeline. 

Adaptation of each stage to operation time is 
performed with another code, k, stored in the PI 
instruction. Since each LSI module of processor P. is 
equipped with the same modular control organization 

10, 11, 13] this allows one to organize variable time 
intervals for any processor dependent operation. 
Therefore three codes, D, w, and k, effect a pipeline's 
adaptation to the operation, to the length of the 
pipeline, and to the time of operation executed by 
each pipeline stage. 

In addition to adaptation codes, each PI 
instruction stores the relative address, A_, of M., 
memories where each memory M. stores a word 
required by the C, stage or the address of m. register 
set. Using this address, a second operand is fetched to 
P. from m.. A special tag bit, in the cell accessed by 
A. address recognizes whether the second operand is 
stored in M. or m.. In order not to have any 
limitations on the height of memory M. and not to 
inerease the bit size of PI, it is assumed that A_ is a 
relative 8-bit address. The effective address R (24 
bits) is formed from a concatenation of the base 
address B (16 bits) and address A_ (8 bits). Base 
address B is fed continuously to all Me memories during 
the execution of one task. It is changed when data 
words have to be fetehed from a new page. Such 
organization provides that each M. memory may 
contain 2°” words. Finally, the PI instruction stores 
the S code which shows the registers of the P. 
processor to be connected with the adder inputs an 
the destination of the computational result if other 
than the next stage. 

The AI instruction stores position code d of a 
pipeline stage which should receive a temporary result 
produced by some other stage and the address a_ of m 
register set where the result should be written. In 
each ASE element, d and a. values may be modified by 
addition with constants stored in ME memory assigned 
to ASE. This allows a programmer to send every 
temporary result computed by PI instruction in each 
pipeline stage it propagates to any m register set. Thus 
every pipeline stage may be provided with all 
temporary results it may need in the future. 


ORGANIZATION OF A PIPELINE STAGE 


Consider now organization of a single pipeline 
stage. Each stage is assembled from one DC-group 
having n computer elements, CE [10, 11] - Each CE 
processes h-bit words. Then a pipeline stage C; may 
have its processor P. assuming the following word sizes: 
h, 2h,...,n-h. If h = 16, n = 4, P. ranges from 16 to 64 


bits in 16 bit increments, i.e., it assumes 16, 32, 46, 64 
bit sizes 
A pipeline stage Ci is equipped with the general 


register set m. storing temporary results and 
partitioned into two levels m. (1) and m. (2) (Fig. 3). 
The reason for this is that the same m. register set 
may be accessed twice during one time interval: when 
an operand is fetched and when the temporary result is 


dq 
| 4 | | 
Zz a 
| | to main 
| d40 | memory 


d4 dj 


MSE;_-1 


Ap 
B 


FIGURE 3 


Hardware diagram of one stage 
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written. To perform these accesses in parallel takes 
two levels of memory so that if one access is performed 
over one level, then the second one is performed over 
another level. | 

Since each pipeline stage C may send its 

computational result to any m register set the main 
memory is provided with F + 1 logical circuits H,,..., 
Hp4, Where F is the number of stages (In Fig. 
id), Each H., logic (j = 1,..., F) but the last one, Hap 
broadeasts an k-h-bit data word in two directions -to or 
from the respective m. register set. The last circuit 
He +1 connects this stage with the main memory, for 
the case the instruction ends its execution and its 
computational result has to be sent to the main 
memory. Selective activation of each H. circuit 
connecting the P. processor with the m, register set is 
performed with position code d.. : 

Since m, register set is partitioned into two levels 
m. (1) and m, (2) then communication with m, (1) is 
made through Z- pins of P. processor likewize 
communication with m. (2) is accomplished through Z 
pins. Selection of the ih. register set which has to store 
the computational result obtained in the P. processor is 
performed with the position code d stored in the Al 
instruction. 

As was shown above, the AI instruction is available 
in the ASE. address element at the time interval the 
pipeline stage C. executes the respective PI portion of 
the instruction. If the C; stage has to send temporary 


result to m; register set of another stage Ci the 


ee 


position code d. generated in ASE. activates broadcast 
of both the teniporary result through H. circuit and the 
destination address A_ through E. logic. To this end, 
each ASE element is provided with F logical circuits 
Ey . +e, E, each E. broadeasting destination address 
A_ of m, general register set (i = 1,..., F). If the 
result obtained in the P. processor has to be written to 
the local m. register set belonging to the same stage, 
the AI instruction stores d. code (d. = i) so that it 
selects E. and H. circuits for broadcasting both address 
and computational result. 

The height of the m register set is relatively 
small, since it stores temporary results. Therefore, 
most of the times, the final result of each PI 
instruction must be written to the main memory of the 
pipeline system. Since PI instruction implements a 
sequence of w operations it ends its execution in stage 
C... Sinee w varies in wide ranges any pipeline stage 
ean be the end stage of some instruction. Therefore 
any pipeline stage must be capable of sending its 
computational to the main memory. 

This is organized as follows. If the PI instruction 
obtains the final result in stage C_., this result is sent 
to the next stage Cy x The reason for this is that 
the Cw ; 1 Stage has empty cell in address B+ A, 
where A_ is the relative address stored in By 
instruction. It then follows, that one may store in the 
B+A_ address of data memory M,, , 1 the destination 
address of the main memory where the final result can 
be written. This address is fetched from M, , 
memory to the R4 register of the Pp + 1 Processor 


FIGURE 4 
A simplified diagram of the P; processor 
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(Fig. 4). Sinee each stage sends its result to the R3 
register of the next stage, the P 4. 1 Processor having 
the final result and its main meMory! address in its R3 
and R4 registers will broadcast them to the main 
memory through logical circuit Hp , 1 This circuit 
connects each stage with the main memory. (In Fig. 3, 
Hy 4 1 = Hy, since F = 10) It is activated by the signal 
w produced in the last connecting element MSE._ which 
propagates instruction. This connecting element is 
recognized by equality d.=w between its position code d. 
and the code w stored in PI instruction which shows how 
many stages the PI propagates. 


COMPUTATIONS IN A PIPELINE STAGE 


At each time interval pipeline stage Cc; performs 


two phases of computations concurrently. 

(a) basie phase aimed at execution of the 
operation provided by current PI instruction. 

(b) preparatory phase aimed at preparing data 
words for the next PI instruction in the same pipeline 
stage. 

Let us consider the actions executed in each 
pipeline stage by each of the phases, 


Basie Phase 
This phase executes the operation assigned by PI 
instruction to the C. stage. The PI instruction is in R5 
register of the P. prdécessor. It was transferred there at 


the previous time interval (Fig. 4). The operation is 


performed over two operands stored in registers of the | 


P. processor. The result of the operation may be sent 
to any combination of the following destinations: 

(1) P processor of the next stage: By passing 
through Z 4 pins of the P. processor, the result is written 
to the R3 register of the next P. . , processor. This 
oceurs when P. 1 processor completes its basie phase. 

(2) m register set of any pipeline stage: This 
transfer is performed through the pair of circuits H, E 


described above and activated by the d position code 
stored in the AI instruction. 


(3) main memory of the pipeline: transfer is 
made through the Hy +1 circuit. 


Preparatory Phase 


In each pipeline stage, the preparatory phase is 
executed concurrently with the basic phase. Its 
objective is to prepare operands for the next PI 
instruction. Let us establish the actions executed in a 
pipeline stage C. during preparatory phase. 


(1) Writing the next PI instruction: In the 
beginning of preparatory phase, the preceding MSE._, 
connecting element sends the next PI instruction to the 
P. processor and MSE. connecting element of stage C.. 
In the P. processor it is received through Ze pins to RS 


register (Fig. 4). 


(2) Writing the result of the operation executed 
in the preceding stage: The P. processor receives this 


operand to its R3 register 

(3) Fetch of the second operand from a data 
memory M or m attached to the stage: Each stage may 
etch the second operand either from M or m memories. 
To this end it forms the effective address B + A,, 
where A_ is brought by PI instruction. The M memory 


is accesded by this address directly, i.e., the operand is 
stored in the B + Ap cell. When fetched it is written to 
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R4 register in the processor. The m register set is 
accessed indirectly because the B'+ A,, address of the 
M memory stores the address of m register set. Being 
fetched from the M memory this address is sent to the 
m memory. The operand fetched from there is written 
either to R1 register if it was stored in m (2) level or 
to R2 register if it was stored in m (1) level of the m 
register set. 

Sinee the actions (1) to (3) prepare all the 
necessary information for the next basic phase, upon 
completion these actions, preparatory phase ends and 
is followed by the basic phase 


SYNCHRONIZATION OF BASIC 
AND PREPARATORY PHASES 


Since each pipeline stage is equipped with the 
modular control organization, it may generate minimal 
operation times for all operations it executes. This 
allows one to obtain a pipeline working with variable 
rate. If the pipeline is filled with short operations 
executed during minimal times, it produces the results 
faster with the rate of a short operation. Therefore 
implementation of a variable operation time in a 
pipeline stage is the source of additional speed-up, in 
comparison with pipelines working with permanent 
operation time in a stage. 

However, introduction of a variable operation 
time in a stage requires synchronization of basie and 
preparatory phases in different pipeline stages. 
Indeed, any next PI instruction and two of its operands 
ean be prepared for execution in the P processor of a 
pipeline stage only at the moment of time when the 
current instruction PI ended its execution and sent the 
result to the next stage. It then follows that for any 
stage its preparatory phase cannot be shorter than its 
basie stage. 

Furthermore, since preparatory phase in a stage 
prepares two operands for execution, it requires 
availability of these operands. Of these two, one 
operand to be fetched from M or m memories can be 
made available provided the times of direct accessing 
the M memory or indirect accessing m register set are 
smaller than the time of the fastest operation 
executed in a stage (16 bit addition). As for the 
second operand to be received from the preceding 
stage, C; _ 4, it may be sent to the stage C., only when 
Ci completes its operation. It then follows, that if 
Ci executes shorter operation than C,., the result 
fromC. . appears earlier than C. completes its 
operation, i.e., its basic phase. Since the operands 
may be written to C. only when it completes its basic 

1 : fae : 
phase, the preparatory phase in stage C; eoincides with 
its basic phase. 

Therefore, for two pipeline stages Ci and C. 
executing a shorter and longer operations respectively, 
the time of preparatory phase in C. coincides with the 
time of the basic phase in the same C.. 

If, on the other hand, C,_, Stage executes a 
longer operation than C., then the operand from stage 
Ci appears after C. completes its operation. Then, 
the preparatory phase in C. is determined by the basic 
phase in C. ,. Thus it will last longer than the basic 
phase in C.. Therefore, for two pipeline stages Cia 
and C. executing longer and shorter operations 
respectively, the time of preparatory phase _ in C; , 
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FIGURE 5 


Sequencers in a pipeline stage 


matches that of the basic phase in the preceding stage 
Caaa 

ee Thus we have shown that for the C. stage, its 
preparatory phase either coincides with the basic phase 
of the same stage C., or with the basic phase of the 
preceding phase C. ,. Therefore for each pipeline stage 
lel oi 
its preparatory phase introduces no additional delay in 
the rate of pipeline operation which is determined only 
by consecutive durations of its basic phases. 

Such synchronization of phases ma be 
accomplished as follows. As was shown in (12 , the 
modular control organization provides that. duration of 
each PI instruction be determined by the two 
interactive units CAD-I and CAD-M, where CAD-I 
functions as either a decoder or sequencer and CAD-M 
is a sequencer which specifies the time of operation. If 
PI instruction activates a non-iterative operation in a 
stage (addition, subtraction, Boolean), then in the CAD- 
lof this stage decoder CAD-ID is activated (Fig 5). For 
iterative operation (multiplication, division) executed in 
a stage, its CAD-I works as a sequencer CAD-IS. The 
CAD-I functioning is controlled with the following 
eodes: the op code D it receives with the PI instruction 
and position signal i produced locally by the position 
code d.. 

onsider now how one may organize a variable 
time T of operation executed in C,. stage of the 
pipeline: T is variable, i.e., T = t-b, where t 
time of h-bit addition in one LSI module, b, depends on 
k and D codes stored in PI instruction. For a processor 
dependent operation (addition, subtraction), which has 
to last T = k-t_, the output of CAD-I decoder initiates 
the CAD-M sequencer which executes a loop having k 
states. During this time the CAD-I maintains a 
microcommand MIC which activates operation in the 
processor. When CAD-M completes its loop, this 
terminates the CAD-I output. If the operation is 
independent of the processor size (Boolean, shift, etce.), 
then it is activated by the CAD-I decoder only. 
Namely, CAD-M is not initiated and the operation takes 
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time t_ of one small clock-period. If C. stage executes 
an iterative operation, CAD-I executes a sequence 
containing several states. If a state in this sequence 
has to last the time k-t_, CAD-I initiates the CAD-M 
and performs transition to the next state under the 
completion signal issued by the CAD-M. 

As follows from this organization, in each pipeline 
stage the completion of basic phase occurs when CAD- 
M performs transition to its initial state and issues the 
completion signal, CS. It then follows, that if C. , has 
to delay its basic phase because the next stage C. 
executes a longer operation, then this delay can be 
accomplished if CAD-M in Ci delays transition to the 
initial state until CAD-M of the next C. establishes the 
state which immediately precedes the initial one. At 
the next clock period, both CAD-M perform concurrent 
transitions to their initial state which will mean 
concurrent end of basie phases in C. , andC.. 

Such synchronization can be accomplished if any 
transition of CAD-M in Ce to the initial state is 
activated by either a complétion signal CS. generated 
by the CAD-M in C; or its immediate predecessor PR. 
produced by the state which precedes the initial one, 
i.e., CS. v PR.. This means that in the pipeline each 
right Processor P. has to be connected with its next left 
processor P. , via one line connection, sending signal 
CS. v PR. produced by CAD-M in P.. This signal 
activates transition of each left CAD-M to the initial 
state. This will accomplish synchronization of shorter 
and longer basic phases executed in C,, and C, 
respectively. 

If Cy and C. execute respectively longer and shorter 
basic phases, then CS, v PR, generated in CAD-M of C, 
eannot be used for synchronization, since CAD-M in C, — 
finished its operation much earlier that that in C., For 
this case the time when C,_, completes execution and 
issues an operand for C. can be determined by the PR,_ 
signal generated by CAD-M in C i-1° Indeed, at the néxt 


clock period Ci will send the operand to R, register 
of C. causing -dmpletion of preparatory phase in C.. 
This can be accomplished if each left processor Pi is 
connected with its next right neighbor P. via one fine 
; ; : 1 i 
connection sending signal PR. produced by CAD-M in 
Pia This signal will enable writing of two operands 
and PI instruction to registers of P. processor. 

This synehronization of phases requires that every 
pair of neighbors P, , and P. be connected with two 
lines, so that using one line B. synehronizes P._, with 
signals CS. v PR; generated in P, and using another line 


P.4 synchronizes P. with signal PR. generated in Pia 
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OPERATION OF ADDRESS ELEMENT ASE 


Each ASE. element generates the address A_ and 
position code d of the m register set which has to 
receive temporary result produced in stage C.. Since 
for every PI instruction each stage C; may send its 
result to any m register set, each ASE. element 
receiving the AI instruction has to be provided with an 
opportunity to change the values of a_ and d, stored in 
the AI instruction. Then each time the AI instruction is 
stored in ASE. it may have new values of A and d 
requiring no additional bits in AI to store then: This 
will allow to have the minimal size in AI (required to 


ASE9 ASE3 
@ An d An 


FIGURE 6 


Address elements ASE 


store one address a. and one position code d only) and 
at the same time to maintain this powerful option of 
communication between pipeline stages. 

Let us consider one organization which changes d 
code ond address for each pipeline stage. It 
provides that each address element ASE contain one 
microprocessor UM and two levels of memory ME-1 and 
ME-2 (Fig. 6). The fast and small memory ME-1 is 
eonnected with UM; the slower and larger memory ME- 
2 is connected with ME-1. Thus ME-1 may send its 
information to UM, likewise ME-2. may send its 
information to ME-1. ME-1 stores constants that may 
be added with values stored in AI instruction. By 
writing different constants Ki Ki, Ki41. ete. to the 
same address an of consecutive memories ME-1._45 ME, 
ME-1. respectively one may achieve variation in d 
eode diid a_ address when the AI instruction propagates 
through consecutive ASE elements. Thus one cell in 
each ME-1 stores information to modify one AI 
instruction for the respective ASE. Since the number of 
cells in ME-1 is small (in as much as ME-1 has to be 
very fast) to increase the number of consecutive AI 
instructions which ean be modified via a single ME-1, 
each ME-1 is connected with a larger memory ME-2 
designated to replenish content of each cell in ME-1 
whenever it is accessed by the AI instruction. Thus a 
modified content of this cell may now be accessed by a 
new AI instruction causing no restriction on the number 
of AI instructions which may modify their d and a 
values. 

For example, consider modification of d code and 
a_ address stored in AI instruction when it propagates 
through ASE,, ASE, ASE, (Fig. 6). When the Al 
instruction is received in ASE,, the address a_ it stores 
is sent to both memories ME-1 and ME-2. The ME-1 
fetches constant k, which is used to modify a, and d: 

an + ki Ay» +k=-»>d. The new values nN and d 
are issued by ASE,- At the next interval, the same 
address a_ accesses ME-2 which sends a new constant 
ky to the a, cell of ME-1. Thus the cell of ME-1 
accessed by the AI instruction in ASE, is updated. 
When the Cy stage completes its basic phase producing 
PR, signal, the AI instruction is transferred with this 
signal to the next ASE, element where it is added with 
new constant k, stored in the same address a_, etc. 


Therefore, by passing through F consecutive ASE 
elements the same instruction may generate F different 
meanings of A, and d. This allows a programmer, for 
each pipeline instruction containing w stages to form w 
addresses and position codes, so that each stage 
activated by the PI instruction may send its result not 
only to the next stage but also to any m register set. 
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Abstract 


In a multiprocessor system with limited 
communications paths between processors, a heavy 
communications demand is made by an algorithm 
requiring input values from all processors in the 
system. The Finite Element Machine is a special 
purpose computer designed from 1024 microprocess-— 
ors communicating by way of explicit word-serial 
channels. The algorithms for structural analysis 
using the finite element method which have been 
proposed for this machine require both maximum 
and summation over a set of numbers stored one 
per processor. Three distinct types of sum and 
maximum algorithm making different use of the 
communications paths have been formulated and 
analyzed with respect to execution time. The 
results give some insight into the best way to 
use communications paths in a multiprocessor. 

The study demonstrated the need for a special 
hardware mechanism to support the global sum and 
maximum operations in this machine. The design 
of this hardware unit is also discussed. 


Introduction 


The Finite Element Machine [1,2] is a special 
purpose computer architecture developed to support 
structural analysis using methods based on the 
theory of finite elements [3]. The machine con- 
sists of a large number (1024 is the target de- 
sign) of microprocessors communicating over a 
network of parallel "local" channels connecting 
each processor to a limited number of other pro- 
cessors. The parallel local channels are backed 
up by a time multiplexed global bus connecting ail 
processors. If a node in a finite element model 
of a structure is considered to be "connected" to 
another node when the two correspond to a nonzero 
stiffness matrix coefficient, then the model forms 
a graph with the qualitative characteristic that 
each node connects to only a few others. The 
Finite Element Machine with processors as nodes 
and local communications links as edges has the 
same qualitative characteristic. The idea is to 
use this qualitative similarity to carry out the 
solution of finite element model equations in 
such a way that most of the required interpro- 
cessor communications can be performed using the 
parallel local channels while the time multi- 
plexed bus takes care of the mismatches between 
the finite element model topology and the topolo- 
gy imposed by the fixed local channels. 
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The linear equations of the finite element 
model are solved by iterative methods so that 
most data interchange between processors is deter- 
mined by the topology of nonzero elements of the 
Sparse stiffness matrix. In any iterative method, 
however, a test for convergence must be made which 
depends on values associated with every node. In 
particular, a maximum must be computed over 1024 
values, one stored in each processor. Rapidly 
converging iterative methods, such as the con- 
jugate gradient method [4], require not only maxi- 
mum but also summation over values from all pro- 
cessors. These global computations strain the 
communications capabilities of a machine tailored 
to the more limited requirements of iterative up- 
dating of values. 


This paper takes the architecture of the 
Finite Element Machine as fixed and proposes 
three algorithms for performing the global sum 
and maximum calculations. These algorithms are 
compared with respect to execution time to measure 
their suitability for a 1024 processor machine. 
The fixed machine architecture is then extended 
by the addition of special hardware to make these 
computations more efficient. 


Description of the Multiprocessor 


The Finite Element Machine consists of a 
large number (~1000) of microprocessors communi- 
cating by way of a network of point to point 
First-In-First-Out (FIFO) communications. The 
multiprocessor array is not easily characterized 
as tightly or loosely coupled [5]. The speed and 
information carrying capacity of the communica— 
tions network would indicate a tightly coupled 
system while the lack of shared memory would sug- 
gest loose coupling. The communications occur 
over two separate networks: the local network 
and the global network. The local network con- 
sists of a large number (~8000) of bidirectional 
unshared connections between two nearest neighbor 
processors. The global network consists of a 
single time multiplexed bus connecting all pro- 
cessors. A third interconnecting network is com- 
posed of a set of signal flags which are primarily 
oriented toward processor synchronization. A 
brief description of the three networks intercon- 
necting the processors is necessary to an under- 
standing of the algorithms to follow. Further 
details can be found in [2]. 


The local communications network imposes a 
fixed interconnection topology on the 16 bit 
microcomputers making up the array. The current 
prototyping effort being carried out at NASA 


! 


. word. 


Langley Research Center will allow investigation 
of several interconnection patterns. For the 
purposes of this paper, however, we will consider 
only a square array of 32 x 32 processors with 
each processor having a local communications link 
to each of its eight nearest neighbors. Process- 
ors on the boundaries are connected to those on 
opposite boundaries in the manner of a toroid so 
that all processors have a full eight neighbors. 
The global bus simply connects in parallel to all 
of the processors in the array. Of course this 
is only true logically; electronically, the bus 
and its arbitration network form a tree structure. 
The structure of the signaling flag network paral- 
lels that of the global bus. An abbreviated PMS 
[6] diagram for the processor array is shown in 
Figure 1. 


L(Global bus)-C(Control)-(1/0 environment) 


M-S—P (H-13 #5+1) P(#itl #i+1)—S—M > 


P(#i,#5+1)—S-M 


/M-S —P (Hi-1; #3) P (#it1;#4)— S-M 


(P#i; #j )—S-M 


M-S =P (Hi-1; #5-1) P(H+1,#j-1)—S-M >, 


P (#1; #j-1)— S-M 


Figure 1: Partial PMS Diagram for the Array 


Local communications output is broadcast to 
all neighbors simultaneously, although neighbors 
can be selectively disabled from receiving. In- 
formation coming in from a neighbor is placed in 
a hardware FIFO, which the receiving processor 
may interrogate and empty at its own initiative. 
Transmission is actually bit serial to conserve 
hardware but becomes parallel at the processor 
interfaces. Figure 2 shows a logical block dia- 
gram of the local communications interfaces for 
one processor and lists the primitive operations 
the processor can perform in connection with 
these interfaces. The various local neighbors 
are identified by the eight points of the compass. 


The global bus acts as a time multiplexed 
crosspoint switch, allowing the transmission of 
one 16 bit word from any source processor to any 
destination processor. A transaction thus appears 
on the bus as three items: the source processor 
number, destination processor number and data 
Since a bus transaction time is short com- 
pared to an instruction time a FIFO buffer re- 
ceives input from the bus. Further, since con- 


tention for the bus may delay a transmission, 


FIFO buffering is also provided for output to the 
bus. A destination register allows a processor 

to transmit several successive data words to the 
same destination without respecifying it. A spe- 
cial processor number serves to identify broadcast 
data and matches the address of any processor 
which is enabled for broadcast reception. 


From N From NE From NW 
Neighbor Neighbor Neighbor 
FIFO [FIFO et AG FIFO 
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Input 
Processor 
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4. |Enable Enable 
OY) x Ce 
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To N To NE To NW 
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a) Local Communications Interfaces 


Operations: 


Broadcast word to all 
enabled neighbors. 


Output (word) - 


Enable (j) 
Disable (j) 
Input (j,word) 


Interrupt enable (j) 


- Enable neighbor j. 

- Disable neighbor }j. 

- Input word from FIFO 
j and advance it. 

- Enable FIFO j non- 


empty interrupt. 
Disable FIFO j non- 


Interrupt disable (j) - 
| empty interrupt. 


Tests: 
Output busy? - Is any enabled neigh- 
bor's FIFO full? 
Input ready (j)? - Is input FIFO j non- 


empty? 
b) Local Communications Primitives 


Local Communications 
for a Single Processor 


Figure 2: 


Figure 3 shows the biock diagram and primitive 
operations for a processor's global bus communica- 
tions. 


The signal flag network consists of eight 
single bit variables, or flags, per processor. 
One of these and the corresponding flags in all 
other processors form the inputs to a network 
which -forms the combinational functions AND and 
OR over all its inputs. The OR and AND functions 
are available to each processor and indicate 
whether the corresponding flag is set in any 
other processor or in all other processors, res- 
pectively. An enable bit allows a flag to be 


Processor 
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@ 
Output ° FIFO Input e FIFO 

e Address ® 
Source | Destination Data | UeEECEOS Source Data 

Global bus 
a) Global Communications Interfaces 


Operations: 


Set destination (i) - 
Send word (w) - 
Set broadcast ~ 
Read source (s) - 
Read data (w) - 
Receive broadcast - 
Ignore broadcast - 


Tests: 


Input ready? = 
Output full? 


b) 


Figure 3: 


effectively connected into or isolated from the 
computational network. There are eight indepen- 
dent and nearly identical networks, one for each 
of the eight flags available to a processor. 
associated with each flag is a synchronization 
bit, Syne i, which is set when the AND function 
becomes true and reset when the OR function be- 
comes false. The only difference between the 
eight flag networks is that one of them, flag 
zero, is augmented by a unique selection network 
used in some algorithms to solve the "multiple 
hit" problem but not needed below. Whenever a 
group of enabled processors asynchronously set 
this flag, one of them will be designated as the 
"first" to do so. The "first" indicator is thus 
a separate bit associated with flag zero for each 
processor. The structure and primitive opera- 
tions for the signal flag network are shown in 
Figure 4. | 


Also 


Set 
Put 
Set 
Get 
Get 


destination register to i. 

(Destination, w) into output FIFO. 
destination register to broadcast address. 
source address from head of input FIFO. 
data and advance input FIFO. 


Sensitize detector to broadcast address. 
Disable detection of broadcast address. 


Is input FIFO nonempty? 
Is output FIFO full? 
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Global Communications Primitives 


Global Communications for a Single Processor 


Methods for Performing Array Computations 


The finite element method algorithms pro- 
posed for the multiprocessor array described 
above primarily require local computation with 
input from neighboring processors; but also some 
calculations involving numbers from all process- 
ors in the array are needed. Specifically, the 
test for convergance in an iterative algorithm 
requires the computation of max xX. where Xx, 

i 
is a value contained in processor i only and 
the maximum is taken over all processors. Using 
the conjugate gradient method to perform the 
iterative linear equation solution requires the 
computation of 2% xX. where the sum runs over all 
ds 


processors. Again Xx. is local to a given pro- 


cessor and in this case it is the product of two 
local values. The computation performed is 
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a) Signal Flag Interfaces 
Operations: 
Connect (k) - Enable the kth flag. 
Disconnect (k) - Disable the kth flag. 
Set (k) - Set flag k. 
Clear (k) - Clear flag k. 
Tests: 
Any (k)? - Is flag k set in any connected processor? 
All (k)? -~ Is flag k set in all connected processors? 
Sync (k)? - Was All true previously? 
First? - Was this processor's Set (9) the "first" one? 
b) Signal Flag Primitives 
Figure 4: Signal Flag Communications for One Processor 
actually a vector inner product but only the sum single processor which would compute the maximum 
mation requires calculation involving the whole sequentially. This would require 1024 times the 
array. length of a several instruction loop in a micro- 
| processor of the array. A possibly faster method 
Consider then the isolated problem of evalua- would be to compute maxima over subgroups of the 
ting max X, over the entire array of nodal pro- nodal processors, using the local connection net- 

: i ae : ‘ ; 

i work to transmit information without conflict 
cessors. At least three distinct methods of eval- between groups. These subgroup maxima could then 
uation are possible using different aspects of the be combined using local or global communications 
three processor communication networks. First all to form the maximum over all processors, perhaps 


values could be sent over the global bus to a using higher level grouping of subgroup results. 
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Call the first method using one processor for the 
maximum calculation the central computation method 
and call the second method distributed computa-~ 
tion. 


The third method that presents itself might 
be called cooperative computation and operates 
using the network of signal flags. This is a 
tightly synchronized calculation which uses some 
Signal flags for synchronization but also uses a 
signal flag to compute the maximum value in a bit 
serial manner. The computation proceeds roughly 
as follows (processor synchronization is not men- 
tioned explicitly): 

1. All processors connect themselves to a signal 
flag, say Flag k, by setting the Enable k bit. 
A counter & is initialized to zero. 


Each processor with its Enable k bit set sets 
its signal Flag k to the &th bit of Xx, 


where the bits are numbered starting with 
zero at the left. 


Each processor performs the Any (k) test and 
records a lin the &th bit of the maximum if 
Any (k) is true and a 0 if it is false. If 
Any (k) is true each processor which still 
has Enable k set clears it if its Flag k is 
zero. 


All processors advance to the next bit of. Xx. 


by incrementing &. If there are more bits, 
the loop is repeated beginning at step 2, 
otherwise step 5 is executed. 


At this point all processors which still have 
Enable k set (always at least one) are asso- 
ciated with an Xx, which equals the maximum 


value. All processors have recorded the maxi- 
mum so that the result of the computation is 
already distributed. 


Many possible versions of the distributed 
computation of the maximum are possible depending 
on the way in which the processor array is divided 
into groups. We considered what is perhaps the 
simplest of them in which groups of nine process- 
ors are formed initially (so far as possible). 

One processor receives values from each of its 
eight neighbors and computes the maximum of the 
nine values to which it has access. These receiv- 
ing processors then combine subgroup maxima in 
pairs using the global bus for communications. 

The pairwise maxima form a binary tree with the 
final array maximum appearing at the root node. 

Of course, an array of 1024 processors is not 
evenly divisible into groups of nine so some of 
the groups are smaller. 


The initial grouping on the array is indi- 
cated in Figure 5. In the first phase of the com 
putation the processors marked A merely transmit 
a value over the local network to a neighbor. The 
B processors receive and form the group maximum of 
these values. The three types of B processors, 
Bl, B2 and B3, compute maxima of 9, 6, and 4 
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Partioning the Array for 
Distributed Computation 


Figure 


values respectively. In the second phase of the 
computation, the 121 B processors are formed into 
a seven level binary tree with half of the pro- 
cessors remaining at any one level transmitting 
values over the global bus to another processor 
and disabling themselves. Thus 61 transmissions 
will time multiplex the global bus at the first 
tree level and this number will be successively 
halved at successive levels. The longest program 
will be executed by the processor which eventually 
forms the root of the tree and it is this program 
which will be used, along with estimates of over- 
head due to global bus conflicts, to determine 
the speed of the distributed algorithm. 


Comparison of the Methods 


The problem is to decide which of the three 
computational methods (centralized, distributed 
or cooperative) is most efficient for a 1024 pro- 
cessor version of the finite element machine. To 
do this, algorithms of each type were formulated 
and coded. A timing analysis of the code was per- 
formed and the overhead due to global bus conges- 
tion was estimated and added to the code time. 
Note that the first two computational methods, 
centralized and distributed, are also possible 
ways of calculating 2} X.- The cooperative method, 

i 
using a similar bit serial approach, is possible 
if one of the signal flag networks is augmented 
with a parallel counter. This would appear to a 
processor as an 11 bit port, Count (k), which 
gives the number of processors having Enable k and 
Flag k both set. We will take the analysis for 
Max X, as indicative and use the results to de- 
i 
termine whether the addition of a parallel counter 
is cost effective for the computation of 2% Xx. 
i 


The algorithms for central, distributed and 
cooperative calculation of Max x, were timed 
and compared using hardward parameters from the 
prototype Finite Element Machine desinged at the 
University of Colorado and under construction at 
NASA Langley Research Center. The microprocessor 
used is a Texas Instruments TMS 9900 with an aver- 
age instruction time of about 6 microseconds. The 
time for one transmission over the global bus is 


Begin distributed 
calculation of maximum 


Set maximum to 
ithis processors value. 
Initialize to 
first of 8 neighbors. 


Neighbor value available? No | 


Phase 
1 | Replace maximum 

| by neighbor value if 

the latter is larger. 


Advance to next neighbor. 


No_@il neighbor values done? 


Yes 
Initialize number 


of group maxima 
received to zero. 


_f Input available 
~\from global bus? 


Yes 


Read the value and replace 
maximum with it if 


the new value is larger. 
Phase 


; Count a group maximum 
received for this level 
of the binary tree. 


All 7 tree levels 
complete? | 


Yes 


Broadcast overall 
maximum over global bus. 


End distributed 
calculation of maximum 


Figure 6: Distributed Calculation of Max Xx, 
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taken as 0.5 microseconds. The time for the cen- 
tral computation is determined by the algorithm 
executed by the single central receiver. This 
algorithm is a straightforward access and com- 
bination of 1024 values. The TMS 9900 program 
required the execution of 9,219 instructions so 
that the time for evaluating the maximum would be 
about 55.3 milliseconds. Since a global bus 
transmission time is shorter than an instruction 
time, no transmission overhead need be added to 
this number. 


In the specific distributed calculation des- 
cribed above, there are distinct algorithms exe- 
cuted by distinct groups of processors. Some of 
them compute maxima over values from their eight 
neighboring processors and themselves, transmit 
this local maximum and then wait for-a broadcast 
message containing the overall maximum. This cor- 
responds to the section marked Phase 1 in the 
flowchart of Figure 6. In Phase 2, one of the 
processors involved in computing the pairwise 
maxima will receive values at all seven levels of 
the binary tree and broadcast the final maximum 
back to all processors. It is not absolutely 
necessary that the same processor execute both 
Phase 1 and Phase 2 of the flowchart, but the 
overall time will be determined by the juxtaposi- 
tion of these two program segments in any case. 
The program code amounts to 175 instructions exe- 
cuted from start to finish or about 1050 micro- 
seconds. Added to this will be non-overlapped 
communications time. The first local network 
transmission requires 20 microseconds. Subsequent 
local transmissions overlap with instruction exe- 
cution. Overlap of global bus transmission with 
computation time is harder to estimate but it is 
certainly bounded above by the total number of 
values transmitted, 120, times the bus transmis-—- 
sion time. This gives an upper bound of 60 micro- 
seconds for a total of 1.13 milliseconds for the 
distributed algorithm. 


The code for the cooperative algorithm fol- 
lows the flowchart of Figure 7. The explicitly 
specified synchronization is an example of so- 
called barrier synchronization in which all pro- 
cessors must reach a given point before any may 
proceed. Here it serves to assure that all pro- 
cessors have input a bit into the Flag 1 network 
before the OR of all the bits is tested. With 
all processors running at the same speed the over- 
head due to synchronizing waits will be zero and 
can thus be neglected even if the processors only 
run at nearly the same speed. The number of in- 
structions executed in this algorithm is 328 for 
a time of about 1.97 milliseconds. 


As expected, the central computation is the 
clear loser taking 49 times as long as the distri- 
buted calculation. Somewhat unexpected, however, 
is the result that the crude distributed algorithm 
which we analyzed is almost twice as fast as the 
cooperative method. This gain comes at the ex- 
pense of having to load different programs into 
different groups of processors while in the coop- 
erative method one program serves for all process- 
ors. Certainly the comparison indicates that the 


Begin cooperative 
calculation of maximum 


Initialize the bit counter and 
the maximum value both to zero. 


Enable Flag 1 for computation 
and Flag @ for synchronization. 


Pee ne ee ere ere 
Have all processors finished the last \No 


synchronization, i.e., is Any ($) clear? 


Yes 


Working from left to right 
set the current bit of this processor's 
local value into Flag l. 


Set Flag (#) to indicate arrival 
at synchronization point. 


Have all processors arrived 
at synchronization point, 
i.e., is Sync (@) set? 


No 


Yes 


Clear Flag (@) indicating 
completion of synchronization. 


Does any processor's value have 
a one in this bit position, 
i.e., is Any (1) set? 


No 


Yes 
Set current bit of maximum to one. 
If current bit of local value is zero 


disable Flag 1 so local value 
no longer influences maximum bit. 


Advance to next bit of maximum 
and of local value. 


No 


Have 16 bits been processed? 


Yes 


End cooperative calculation of maximum. 


Figure 7: Cooperative Calculation of Max Xx. 


i 


addition of a parallel counter to support coopera- 
tive calculation of 2 xX. ‘would not be cost ef- 
i 

fective. On the other hand, the fastest method 
takes a millisecond, which is quite long with 
respect to the processor speed so hardware support 
for both Max Xx. and 2} Xx. is indicated. 

i i 
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Hardware Support for Global Calculations 


As a result of the above analysis a hardware 
circuit has been included in the finite-element 
machine to avoid long delays in calculating global 
sum and global maximum. The configuration of this 
circuit is a binary tree, in which each tree ele- 
ment accepts an argument pair and passes on the 
result to the next higher level. The global re- 
sults, which appear at the apex of the tree, are 
then broadcast to all the processors. With this 
configuration, a tree of N levels can serve 2 
processors, For instance, 1024 processors require 
10 levels and 1023 tree elements. 


The tree elements are serial. Each one 
accepts two serial arguments and produces a serial 
result, delayed by one bit period. The sum and 
maximum functions are calculated alternately, 
sharing the same tree connections and element cir- 
cuitry. The entire calculation sequence involves 
a serial frame of 48 bit periods: 26 for sum, 16 
for maximum and 6 unused. 


The tree input data consists of a single 16- 
bit output register at each processor. Both func- 
tions are thus calculated from the same data, and 
they are separately available. The sum requires 
a double-word input register for the entire re- 
sult; the maximum requires a single-word register. 
In both cases, the data are treated as positive, 
integer values. 


The sum-maximum tree is implemented without 
facilities for process synchronization. Use of 
the circuit will generally require barrier syn- 
chronization, such as that already provided by 
the signal flags. However, functional synchroni- 
zation is included, whereby each output register 
is sampled as a unit. This assures that the re- 
sults are valid even while different processors 
are updating their output registers. 


The delay for the circuit consists of the 
tree delay of one bit period for each level, plus 
the frame delay of 48 bit periods. The delay for 
1024 processors, which requires 10 levels in the 
tree, is thus 58 bit periods. With a bit period 
of one microsecond, the entire calculation of both 
sum and maximum is accomplished in 58 microseconds. 


Conclusion 


The detailed timing analysis for three algo- 

rithms for computing 2 X, and Max xX. over all 
i i 
processors in a multiprocessor array has been per- 
formed. The three algorithms make distinctly 
different use of the communication pathways in 
the machine. The analysis shows that Max x. 
i 

can be computed in 55.3 milliseconds by a centra- 
lized algorithm, in 1.13 milliseconds by a dis-— 
tributed algorithm and in 1.97 milliseconds by a 
cooperative algorithm. The distributed algorithm 
is the most logically complex, requiring about 6 
different programs for different groups of pro- 
cessors. 


As a result of this analysis a proposal to 

support cooperative calculation of 2% x. with 
I 

special hardware in the Finite Element Machine 
was discarded. Instead a complete hardware unit 
to calculate sum and maximum using a bit serial 
binary tree organization was designed. The unit 
can calculate both sum and maximum over the same 
set of 1024 operands in 58 microseconds. 
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ON THE MAPPING PROBLEM 
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MS 132C, NASA Langley Research Center, Hampton, VA 23665 


Abstract--In array processors it is 
important to map problem modules onto processors 
such that modules that communicate with each 
other lie, as far as possible, on adjacent 
processors. This mapping problem is formulated 
in graph theoretic terms and shown to be 
equivalent, in its most general form, to the 
graph isomorphism problem. The problem is also 
very similar to the bandwidth reduction problem 
for sparse matrices and to the quadratic 
assignment problem. 


It appears unlikely that an efficient exact 
algorithm for the general mapping problem will 
ever be found. Research in this area must 
concentrate on efficient heuristics that find 
good solutions in most cases. A heuristic 
algorithm that proceeds by sequences of pairwise 
interchanges alternating with probabilistic jumps 
is described. This algorithm has been used to | 
solve practical mapping problems on a specific 
array processor (the Finite Element Machine) with 
good results. Results for a set of practical 
problems are tabulated, several of which are 
illustrated. 


I. Introduction 

Most arrays of processors are incompletely 
connected, that is, a direct link does not 
connect each pair of processors. The reasons for 
this include (1) the fact that the total number 
of links in completely connected systems 
increases as the square of the number of 
processors—-a growth rate that is unacceptable in 
most cases, and (2) the number of input/output 
ports on each individual processor increases 
linearly with the number of processors--this is 
usually not possible because the number of I/0 
ports is generally fixed at some constant value. 


Suppose a problem made up of several modules 
that execute in parallel is to be solved on an 
incompletely connected array. When assigning 
modules to processors, pairs of modules that 
communicate with each other should be placed, as 
far as possible, on processors that are directly 
connected. We call the assignment of modules to 
processors a mapping and the problem of 
maximizing the number of pairs of communicating 
modules that fall on pairs of directly connected 


processors the Mapping Problem. 


This research was supported by NASA Contracts 


NAS1-14101 and NASi~14472 while the author was 
resident at ICASE. 
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In this paper we first show that the problem 
of finding the best mapping is, in general, very 
difficult. We then describe a heuristic 
algorithm that has been developed to solve this 
problem for a specific array processor. We start 
by giving a mathematical formulation of the 
problem in Section II. In Section III we show 
that in its most general form, the mapping 
problem is equivalent to the graph isomorphism 
problem, one of the classical unsolved 
combinatorial problems. We point out the 
similarities between the mapping problem and the 
bandwidth reduction and quadratic assignment 
problems. Exact solutions for neither of these 
problems exist and they are solved approximately 
using heuristic algorithms. 


In Section IV we describe how the mapping 
problem arises when solving structural problems 
on the Finite Element Machine (FEM), an array of 
processors currently under development at NASA 
Langley Research Center. In Section V we 
describe a simple heuristic algorithm that has 
been implemented and used to find mappings for 
the finite element machine with very encouraging 
results. Results for a number of test cases are 
tabulated, several of which are illustrated. 


II. Mathematical Formulation 
Let the graph of the problem to be mapped 
onto the array be denoted Po gee where 
the nodes or vertices ie correspond to the set 
of modules and each edge (YEE, denotes that 


modules x,yeé MS communicate with each other. 


Let the graph of the array processor be 
denoted Gi=<V E>» where Ny is the set 


of processors and the edges E. represent the 


interconnection pattern of the processors. 


The problem graph oa may be considered to 
be a set of vertices ve and a function 
ae X Vn 7710s 1} such that 
G, x29) =G, Cy ox) and G, (x5x)=0 for all 
X,yE ae GGy)al is taken to mean that 


there is an edge between x and y, i.e. 
(x,y) € E° 


the pair 


The graph of the array, G.> may similarly 
be considered a set of vertices v. and a 


function G :V_ X V -=->{0,1}. 
a’oa a 


We assume that (VIatVots If 
VISE ls a suitable number of dummy 


vertices may be inserted into the problem. We do 


not consider the case oie lval- 


A mapping of problem modules onto processors 


is denoted by the function f :V ------ >V_. 
m p onto a 


The quality of a mapping is determined by 
the number of problem edges that fall on array 
edges. We call this number the cardinality of 
the mapping, denoted [f il. 


The cardinality of a mapping fs is 


If,1= 3 > G (x,y) #G, (E(x) f(y). 


xeEV 

ye ve 
This formula’arises as follows. 
G(x,y)al if x and y in the problem 


f(x) 


[£ (y)] represents the processor onto which 


graph are connected by an edge. 


problem module x [y] is mapped. The expression 
G (£00) fy) )=1 only if the processors 


onto which x and y are mapped are connected. 

Thus the expression inside the summation sign is 
1 only if an edge connecting two modules falls on 
an edge connecting two processors. In summing 
over all aml and yé us each processor edge 


is counted twice, hence the multiplying factor. 


To find the best mapping, we must choose a 
function f. that has maximum cardinality from 


among the (ivi)! possible functions. 


III. Problem Equivalences 


In this section we show that the mapping 
problem, in its most general form (i.e. given 
arbitrary G. and C,)> is computationally 


equivalent to the graph isomorphism problem. We 
point out the strong similarities between the 
mapping problem and the bandwidth reduction and 


quadratic assignment problems. 
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Graph Isomorphism 


Two graphs G) and G, are said to be 


isomorphic to each other if there is a one-to-one 
correspondence between their vertices and between 
their edges such that the incidence relationships 
are preserved [1]. This may be stated more 
formally as follows: .two graphs 


Gi:v, X Vo~~>{0, 1} and 

Go:V, X Vo-->{0, 1} with IV l=lv.I are 
isomorphic if there exists a function 
e:V,-—cd->V, such that 


1 onto 2 
G (x,y) =G o(e(x) ,e(y)) for all X,yEV, [2]. 


The problem of determining whether two 
graphs are isomorphic is one of the classical 
unsolved combinatorial problems. Exact efficient 
(i.e. polynomial time) algorithms for solving 
this problem for arbitrary graphs are not known 
although numerous researchers have attacked this 
problem [3]. Some researchers have reported 
success with heuristic algorithms applied to 
various restricted classes of graphs(see, for 
example [4]). It appears unlikely that an 
polynomial time solution to the general problem 
will ever be found. 


We now show that if we had an exact 
algorithm for solving the general mapping 
problem, we would also be able to solve the graph 
isomorphism problem. 


If two graphs 
the same number of 


mapping G, onto CG. 


equal to the total number of edges must exist. 
If we had an exact algorithm for solving the 
mapping problem, we could use it to map G onto 


G, and, if the two were isomorphic, obtain a 


are isomorphic, they must have 
edges and thus a function 
and having cardinality 


mapping of cardinality equal to the number of 
edges. Thus we could answer "yes" or "no" to the 
question "Are G, and G, isomorphic?" in 


polynomial time, for arbitrary G, and Gos if 


l 
we could solve the mapping problem in polynomial 


time for arbitrary Gi and Go The mapping 


problem is therefore computationally equivalent 
to the graph isomorphism problem and we do not 
hold much hope for finding an exact polynomial 
time algorithm for its solution. 


Bandwidth Reduction 


The bandwidth reduction problem requires the 
permutation of the rows and columns of a sparse 
Square matrix so as to cluster the non-zero 
entries as closely as possible about the main 
diagonal [5]. The mapping problem, as will 
become clear in the following sections, entails 
permuting the rows and columns of the adjacency 
matrix of a problem graph so that it resembles as 
closely as possible the adjacency matrix of the 
graph of the array of processors. Arrays of 
processors that have a regular interconnection 
pattern (as does the FEM), usually have an 
adjacency matrix composed mostly of several 
well-defined bands. The mapping problem for such 
arrays entails permuting the input matrix so that 
as many entries as possible fall on the bands. 
The similarity with bandwidth reduction is 
obvious. 


The bandwidth reduction problem is known to 
be NP-complete [6]. 
this problem have been developed [5],[7]. 


The Quadratic Assignment problem 


In this problem we are given (1) a set of n 
objects alongwith a cost matrix in which each 
entry ee is a measure of the affinity between 


objects i and j, and (2) a set of n locations 


with a distance matrix in which entry ds. 


stands for the distance between locations s and 
t. A function p that maps objects onto locations 
is called an assignment. The problem of finding 
the assignment that minimizes 
Cpt 9s ., is called th 
£4 ts p(i)p (i) Sierra ere 
b ] 


quadratic assignment problem[8]. This problem 
is exemplified by the task of locating 


electrical assemblies in given slots so as to 
minimize the total length of interconnecting 
wires. No efficient algorithm for the solution 
of this problem is known. 


If the affinity and distance matrices be 
Symmetric and have 0,1 entries, the quadratric 
assignment problem reduces to the mapping 
problem. 


IV. The Finite Element Machine 
The Finite Element Machine (FEM), presently 

under development at NASA Langley Research 
Center, is an array of microcomputers 
interconnected in an "eight-nearest neighbor" 
interconnection pattern (Fig. 1) [9],[{10]. In 
addition to the nearest neighbor links, which are 
dedicated to communication between specific pairs 
of processors, there is a time shared global bus 
(not shown in Fig. 1) which is used for 
communication between pairs of nodes that are not 
adjacent. 


Many heuristic algorithms for 
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The machine is to be used to solve 
structural analysis problems as follows. The 
structure is first reduced to a combinatorial 
graph. The edges of the graph correspond to 
structural members and the nodes to meeting 
points of the members (Fig. 2). Each node is 
assigned to a processor of the FEM and 
computation proceeds in parallel, as described by 
Jordan [9]. During the course of the 
computation, there is communication between pairs 
of processors only if the structural nodes mapped 
on them are connected in the physical problem. 
Thus, should an edge of the physical problem fall 
on an edge of the FEM, communication proceeds 
with greatest efficiency via the dedicated 
nearest neighbor connection. Should this not be 
the case, interprocessor communication must 
employ the time-shared global bus with consequent 
degradation in performance. 


Mapping Structures onto the FEM 


Fig. 3 shows the adjacency matrix of a 6 X 6 
FEM. One possible way of mapping the problem 
structure of Fig. 2 onto the FEM would be to map 
node i of the structure onto node i of the FEM. 
This mapping is indicated in the adjacency matrix 
of the structure (Fig. 4) by the use of °%*’s 
where an edge of the structure falls on an edge 
of the FEM, and ’0’s otherwise. The cardinality 
of this mapping is 32, while the total number of 
edges is 80. 


We can attempt to increase the cardinality 
of the mapping by renumbering the nodes of the 
problem or, equivalently, permuting rows and 
columns of the adjacency matrix of the problem. 
The bottom part of Fig. 4 shows an improved 
mapping, with cardinality 74, obtained by 
applying the mapping algorithm that will be 
described in the following section . The 
permuted row and column labels indicate the 
renumbering that must be done to the nodes of the 


problem in order to obtain this improved mapping. 


V. MAPPER: A Pairwise Interchange Algorithm 


We have developed a heuristic algorithm that 
accepts as input the adjacency matrix of a 
problem graph and outputs a permutation of this 
matrix that matches more closely the adjacency 
matrix of the FEM. 


The algorithm proceeds by sequences of 
pairwise interchanges, alternating with 
probabilistic jumps. It starts by accepting the 
problem matrix and the size of the square FEM 
onto which it is to be mapped. It then generates 
the adjacency matrix of the FEM and uses this for 
comparison while improving the mapping. 


Most of the following listing is self 
explanatory. The function CARDINALITY (MAT) 
returns the cardinality of the mapping defined by 
the matrix MAT. 


program MAPPER; 
var MAT, BEST: adjacency matrix; 
DONE, FLAG: boolean; 
begin 
input adjacency matrix of problem, MAT; 
{MAT is taken to be the initial mapping} 
input the size of the FEM, n; 
{the FEM is an n X n array} 
generate adjacency matrix for n X n FEM; 
BEST:=MAT; {the best mapping found so far} 
DONE: =false; 
while not DONE do 
begin{MAIN} 


repeat { SEARCH} 
FLAG: =false; 
for each node do 


begin{ AUGMENT } 

1: examine the pairwise 
exchange of this node 
with all other nodes; 

2: select the one which leads 
to the largest gain in the 
cardinality of the mapping; 

3: if largest gain>=0 then 
make the exchange; 

4: if largest gain>0 then 
FLAG: =true; 

end ; {AUGMENT} 


until FLAG=false; {end SEARCH} 


if CARDINALITY (MAT) <CARDINALITY (BEST) 
then DONE:=TRUE 
else 
begin{ JUMP} 
BEST: =MAT; 
randomly interchange 
n pairs of nodes of MAT; 
end; { JUMP} 


end; {MAIN} 
output BEST; 
end. 


The block SEARCH of this algorithm attempts 
to improve the mapping by considering all 
possible pairwise exchanges of node numberings. 
The exchange that leads to the maximum increase 
in cardinality of the mapping is made and the 
process AUGMENT repeated until no further gains 
are possible. At this point we leave SEARCH and 
if the mapping found during this pass through 
SEARCH is better than the best mapping found so 
far, a probabilistic jump is applied to the 
mapping and the algorithm returns to block 
SEARCH. 


Pairwise interchanges are not guaranteed to 
lead to the best mapping and sometimes lead to 
mappings that are "dead ends" in that they are 
not very close to optimal and no pairwise 
exchange can improve them. The algorithm 
attempts to leave such dead ends by 
probabilistically "jumping" to nearby mappings 
that may permit improvement via pairwise ; 
interchanges. 


242 


The following is a detailed discussion of 
various aspects of the algorithm. 


1. When carrying out pairwise exchanges, we 
choose for each node the exchange that leads to 
the largest gain in cardinality rather than the 
first gainful exchange encountered. We have 
found that this strategy leads to mappings that 
are consistently better than those obtained using 
the second criterion. 


2. We make an exchange even if the largest gain 
encountered is zero. This has little effect at 
the outset, when interchanges with nonzero gains 
are easily found. Towards the end of. the | 
algorithm, this criterion helps slide past "dead 
ends" to mappings which, although they have the 
same cardinality, may permit further improvement. 


3. If the number of nodes on the FEM is N=n Xn, 
then the execution of block AUGMENT will take 


O(N) time. 


4. The algorithm will exit block SEARCH if no 
pairwise exchange leads to an improvement. If 
the cardinality of the mapping found during this 
pass through SEARCH is better than the one found 
during the last pass, the algorithm executes 
JUMP. Here it tries to break out of the "dead 
end" from which no pairwise exchange leads to an 
improvement by probabilistically jumping to a 
nearby mapping, which, although it will almost 
certainly have poorer cardinality, may lead to a 
better mapping upon further application of block 
SEARCH. A copy of the old mapping is saved in 
BEST, in case the new mapping is poorer. 


5. The probabilistic jump described above needs 
to be far enough from the current mapping to 
offer the prospect of improvement, but not so far 
as to undo all the gains made up to this point. 
We have found that an interchange of n randomly 
selected pairs of nodes gives the best results. 


6. If the mapping found after attempting to 
augment from a probabilistically disturbed 
mapping is poorer, the algorithm terminates. We 
have found that further probabilistic jumps very 
rarely lead to improvements. 


7. For an N node FEM, the cardinality of a 
mapping cannot exceed 4N. Each pass through loop 
MAIN must lead to a gain of at least 1. The time 
required to execute this loop is dominated by 


AUGMENT, which takes 0(N2) time. The algorithm 
thus takes O(N?) time in all. 


VI. Performance of the Algorithm 


The algorithm has been implemented and 
tested on about 20 structural problems of 9 to 49 
nodes for FEMS of sizes 4X 4 to 7 X 7. The 
results are tabulated in Table I. Some of these 
cases are illustrated in Figs. 5-7. In most 
cases the algorithm is able to improve the 
mapping dramatically. 


It is difficult to say just how close to 
optimal the mappings obtained by the algorithm 
are, since we have no way of knowing what the 
best mapping for a specific probem is. In cases 
where the cardinality of the final mapping is 
close to the total number of edges, we can be 
sure that it is very close to optimal. For 
example, the mapping of Fig. 2 was improved from 
32 to 74 as shown in Fig. 4 The final 
mapping is very close to the total number of 
edges (80) and must therefore be very near 
optimal. (For this specific example, it is 
possible to prove that the optimal mapping is of 
cardinality 78 [1ll]). In general, we have found 
that graphs whose input mappings have 
cardinalities of around 50 percent of the total 
number of edges or less can usually be improved 
dramatically. 


To get a better idea of the performance of 
the algorithm, we mapped random permutations of 
the FEM onto itself. 
perfectly onto itself, the success of the 
algorithm in doing so gives us some idea of how 
well it performs on general problems, for which 
it is impossible to specify the cardinality of 
the best mapping. The results of this 
experiment, in which we fed the algorithm 100 
random permutations of 5 X 5 and 6 X 6 FEMS are 
listed in Fig. 8. Histograms for the initial 
cardinality, cardinality at the end of the first 
application of SEARCH and the cardinality at the 
termination of the algorithm are given (the 


difference between the latter two illustrates the 


impact of jumping). The algorithm is seen to 
perform very well in these experiments, 
suggesting that the results obtained when the 
algorithm is run on natural structural problems 
are also of similar quality. 


The run times of the implemented algorithm 
on a CDC Cyber 175 vary from about 1/3 sec. for 
4 X 4 problems to 30 sec. for 7 X 7 problems. 


VII. Conclusions 

The observed run times of our algorithm are 
quite acceptable for the current prototype 6 X 6 
FEM. However, as the growth rate of time is 


bounded from below by n2, the algorithm will 
probably not be suitable for very large arrays 
(say 32 X 32). For such arrays, entirely 
different heuristics will need to be developed. 
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PROBLEM FEM 
| NAME NODES | SIZE 
STRUCTURE WITH FREE NODES 33 6X6 
TRUSS 8 _ 4X4 

5X5 

6X6 

7X7 

SHIP RADAR TOWER 25 5X5 
SCHWEDLER DOME 6 4X 4 
+WING BOX 30 6X6 

_*NTF-DOWNSTREAM NACELLE 35 6X6 
NTF-NACELLE GUSSET PLATE 39 7X7 
NIF-DOWNSTREAM NACELLE 28 6X6 
NIF-NACELLE BULKHEAD 30 6X6 
NIF-CRADLE 33 6X6 


+The algorithm was unable to improve the given mapping. 
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MPP - A MASSIVELY PARALLEL PROCESSOR 


Kenneth Batcher 
Digital Technology Department 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


Summary 


The processing of satellite imagery re- 
quires very fast two-dimensional data 
processors. For example, the Landsat-D 
satellite will transmit about a million pic- 
ture elements (pixels) per second to the 
ground (on the average). From 100 to 10,000 
operations per pixel are required so this 
Satellite alone requires a computer that can 
perform 100 million to 10 billion operations 
per second. The Massively Parallel Proc- 
essor (MPP) is designed to process two- 
dimensional data such as satellite imagery 
at high speed. 


The main feature of MPP is the large 
number of processing elements — 16,384 PE's. 
Since a typical satellite image contains 
millions of pixels, it is not hard to keep 
such a large number of PE's busy performing 
useful work. The PE's are arranged in a 
128 x 128 square array with each PE connec- 
ted to its nearest neighbors on the north, 
east, south and west. For reliability pur- 
poses, an extra 128 x 4 rectangle of PE's is 
added to the array unit (ARU) and bypass 
Switches are added to the routing network so 
the failure of any PE can be corrected by 
Simply bypassing the 128 x 4 group contain- 
ing it. 


The length of data items being processed 
varies widely. Each spectral band of a 
pixel has a length in the range of 6 to 12 
bits. Intermediate results range from 6 bits 
to over 30 bits. Flags are only 1 bit long. 
Some items may be in the floating-point for- 
mat of an attached host computer. The PE's 
are bit-serial processors so they can accom- 
modate data of any length without waste. 


Each PE has a serial adder, a variable- 
length shift register, a routing-and-logic 
section, a mask section, an 1I/O-section and 
a 1024-bit RAM. The basic cycle time is 100 
nanoseconds. Addition, subtraction and 
searching of integer data occurs with no 
overhead added to time required to read and 
write RAM data. Thus, to add two 8-bit 
arrays and create a 9-bit sum array requires 
25 cycles (2.5 microseconds). Multiplica-~ 
tion, division and floating-point operations 
use the shift register. The mask section of 
each PE is used to turn PE's on and off when 
required by the application program. The 
I/O-section shifts input and output data 
across the array (west-to-east), while other 
data are processed. 


With 16,384 PE's running in parallel, 
the system is very fast (see Table). Des- 
pite the bit-serial nature of the processing, 
even the floating-point speeds compare favor- 
ably with the speeds of several fast number- 
crunchers. 


The nearest-neighbor routing network 
moves bit-planes north, east, south or west 
at 100 nanoseconds per step. The edges of 
the array are either left open or connected 
to the opposite edges under program control 
so the array can be given a planar, 
cylindrical or toroidal topology. An in- 
dependent I/O-routing network inputs data 
through a wide port on the west edge and out 
puts data through another wide port on the 
east edge. Input and output is overlapped 
with processing and can transfer data at 
rates up to 160 megabytes/second. 


The array control unit (ACU) has three 
components. The PE control unit controls 
the PE's directly. It executes micro-coded 
Subroutines to perform the array arithmetic 
and logic operations. The I/O-control unit 
controls array input and output. The main 
control unit performs fast scalar arithmetic 
and calls on the other two components for 
array arithmetic and I/O. Queues allow the 
operations to be overlapped. 


The program and data management unit 
(PDMU) is a standard minicomputer which 
executes the program development software 
(micro-assembler, macro-assembler, linker, 
loader), allows user interaction, runs 
diagnostics and manages the flow of data to 
its peripherals and to a host computer. 
Imagery data can be transferred between the 
host (a DEC VAX 11/780) and the array unit 
at 6 megabytes/second. Images are corner- 
turned between the pixel format of the host 
and the bit-plane format of the array unit 
in a multi-dimensional-access memory. 


SPEED OF TYPICAL OPERATIONS 


EXECUTION 
SPEED * 


OPERATIONS 


ADDITION OF ARRAYS 


8<-bit integers (9-bit sum) 
12-bit integers (13-bit sum) 
32-bit floating-point numbers 


MULTIPLICATION OF ARRAYS 
(ELEMENT-BY-ELEMENT ) 


8—-bit integers (16-bit product) 
12-bit integers (24-bit product) 
32-bit floating-point numbers 


MULTIPLICATION OF ARRAY BY SCALAR 


8-bit integers (16-bit product) 
12-bit integers (24-bit product) 
32-bit floating-point numbers 


* 
Million Operations per Second 


The work reported here was partially funded 
by NASA-Goddard Space Flight Center under 
contract NAS 5-25392. 
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CONCURRENT SEARCH AND INSERTION IN AVL trees (2? 


Carla Schlatter Ellis 
Department of Computer Science 
University of Oregon 
Eugene OR 97403 


Abstract -- This paper addresses the problem 
of concurrent access to dynamically balanced 
binary search trees. Specifically, two solutions 
for concurrent search and insertion in AVL trees 
are developed. The first solution is relatively 
simple and is intended to allow several readers 
to share nodes with a writer process. The second 
solution uses the first as a starting point and 
introduces additional concurrency among writers 
by applying various parallelization techniques. 
Simulation results used to evaluate the parallel 
performance of these algorithms with regard to 
the amount of concurrency achieved and the paral- 
lel overhead incurred are summarized. 


Introduction 


Dynamically balanced binary search trees are 
valuable 
tables and directories. This paper deals with 
the problem of concurrent access to trees built 
by one of the most widely studied of the balanc- 
ing techniques, namely, AVL trees. It has been 
shown [1] that the AVL tree construction is the 
most efficient method of balancing binary search 
trees when operations are limited to insertion 
and searching. 

It is not difficult to imagine an applica- 
tion in which concurrent insertion and retrieval 
of items in a table maintained as an AVL tree 
would be desirable. For example, a compiler 
designed to operate in a parallel processing en- 
vironment might be organized such that several 
processes require access to the symbol table. In 
an earlier paper [2] we considered this problem 
of parallel compilation and found that sharing 
the symbol table among the proposed parallel 
processes presented a major conflict. Therefore, 
investigating the possibility for concurrency in 
the manipulation of these data structures is im- 
portant. 

In this paper we present algorithms for con- 
current search and insertion in AVL trees. This 
work is related to similar studies with B-trees 
(3, 4, 7] and uses the same basic approach of 
placing locks on nodes of the tree. The presen- 
tation begins by defining our notation in terms 
of the AVL insertion algorithm for a sequential 
environment. Next, we outline the paralleliza- 
tion techniques applied in the design of two 
solutions for concurrent search and insertion. 
Finally, simulation results on the performance of 
these parallel algorithms are summarized. 


Def init ions 


We assume the reader is familiar with the 
(a) This work was supported by NSF Grant MCS 
76-09839. 


data structures for implementing symbol 


terminology and operations associated with binary 
search trees. An AVL tree is defined to be a 
binary search tree such that for any node n in 
the tree, | 

lheight(T, (n))-height(T (n))| <1 (where 

T.(n) and T (n) denote’ the left and right 

subtrees of h)- 
Detailed algorithms for manipulating AVL trees 
can be found in [6]. However, we will briefly 
describe the insertion algorithm in order to es- 
tablish the terminology and because an under- 
standing of the sequential algorithm is necessary 
to understand the concurrent algorithms. Each 
node n consists of four fields: leftson and rite- 
son, pointers to the roots of T,(n) and T (n) 
respectively or to NIL if the subtree is empty, a 
data field called key, and bf, which indicates 
whether the height of the right son .is greater 
than (bf=+1), equal to (bf=0), or less than 
(bf=-1) the height of the left son. The balanced 
property of AVL trees is maintained by two 
transformations on the tree called single and 
double rotations. The situations which trigger 
each type of rotation and the modifications made 
to the tree are illustrated in figure l. 


father 


newly inserted node 


the notation h 
indicates the 

height of the subtree 
is h, 


a) Single rotation 


ther 
fathe A 


eritical 
node B ‘ 


b) Double rotation 


newly inserted leaf 


Figure 1 Rotations in AVL Tree. 
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The algorithm for insertion of a leaf with key k 

in an AVL tree is as follows: 
1) Search the tree to find the appropriate 
place of insertion and keep a pointer to 
the last node on the path of insertion 
with nonzero bf (the root if no such node 
exists). This is the critical node, cn. 
Insert the new leaf. 
2) Adjust the bf fields of nodes on _ the 
insertion path between the cn and the new 
leaf. For each such node, n, if the path 
to the place of insertion is to its left, 
k < key(n), bf(n) is changed to -1; other- 
wise, bf(n) becomes +1. 
3) Rotate if necessary: If bf(cn)=0, the 
tree has become higher in the direction of 
insertion and bf must be modified ap- 
propriately. If bf(cn) indicates that its 
higher subtree is in the opposite direc- 
tion from the direction of insertion (e.g. 
bf(cn)=+l and k < key(cn)), bf is changed 
to O. If bf(cn) and the direction of 
insertion coincide (e.g. bf(cn)=-1 and k < 
key(cn)), a rotation occurs according to 
figure l. 


Concurrency in AVL Trees 


We now consider two solutions that allow a 
number of processes to operate concurrently on an 
AVL tree. Both solutions use various locks on 
the nodes of the tree to selectively exclude oth- 
er processes. 


Locking solution 


In the first solution the goal is to allow 
concurrency between a number of readers and a 
writer doing an insertion. The approach is 
Straightforward; namely, during its search phase 
a writer will lock other writers out of those 
nodes which may be involved ina rebalancing 
operation. Readers will be locked out of the 
fewest nodes possible and only during a rotation. 
Thus readers can share nodes with a writer while 
it is searching for the place of insertion and 
the critical node, adjusting bf fields along’ the 
insertion path, and determining if a rotation is 
necessary- The solution uses three types of 
locks: RHO-LOCKs for readers, ALPHA-LOCKs for ex- 
cluding other writers along the path from the fa- 
ther of the critical node to the place of inser- 
tion, and XI-LOCKs to exclude readers from nodes 
modified during a rotation. 


READ LOCK WRITER EXCLUSION 


EXCLUSIVE LOCK 


Figure 2 Compatibility graph for locks. 


Figure 2 shows the compatibility relations these 
locks satisfy. An edge between any two nodes in 
this graph means that two different processes may 
simultaneously hold these locks on the same node 
of the tree. Thus a node may be RHO-LOCKed by 
several readers while it is ALPHA-LOCKed by one 
writer. However, if a writer holds a XI-LOCK on 
a node, no other process can hold any other locks 
on it. A single rotation requires XI-LOCKs on the 
father of the critical node and the critical 
node. A double rotation requires an additional 
XI-LOCK on the son of the cn which lies on the 
insertion path. Figure 3 gives an example of con- 
current single rotation and read operations. 


reader 2 
searching for k=26 kel4 


writer inserting k=32 reader 1 


insert . 
adjust balance factors ; ’ 
RHO-LOCK H ‘ 
release RHO-LOCK on D 
RHO*LOCK A 


fig. a 
XI~-LOCK B 


~ D 
XI-LOCK compare 14 with key[A} 


approp. son is B 
wait to RHO-LOCK B 


riteson[D]+ J 
leftson{H]r D 


fig. b 
determine approp. 
son is D 
wait to RHO-LOCK D 
riteson(BJ* # 
adjust bf 
I-LOCKs 
ei RHO-LOCK B 
release RHO-LOCK on A 
RHO-LOCK D 
release RHO-LOCK on H 
fig. ¢ ' 


release ALPHA-LOCKs 
and terminate ’ 


° 


Figure 3 Concurrent single rotation 
and reading 
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The algorithms for the reader and writer 


given below: 


are 


READER 


RHO-LOCK pointer to root; 
current <- pointer to root; 
son <— root; 
while son ~=NIL and k~=key[son] do 
begin | 
RHO-LOCK son; 
release RHO-LOCK on current; 
current <- son; 
/* determine appropriate son */ 
if k < key[current] then 
son <- leftson[current] 
else son <- riteson[current] 
end; 
release RHO-LOCK on current; 
if son = NIL then fail 
else succeed 


WRITER 


ALPHA-LOCK pointer to root; 
current <=- pointer to root; 
father of cn <=— current; 
son <- root; 
cn <= root; 
/* search - resulting in path from father of 
cn to place of insertion remaining locked */ 
while son ~= NIL and k~=Key[son] do 
begin 
ALPHA-LOCK son; 
if bf{son]~=0 then begin 
/*change cn pointer*/ 
father of cn <= current; 
cn <= son; 
release ALPHA-LOCKs on ancestors 
of current 
end; 
current <=— son; 
determine appropriate son 
end; 
if son = NIL then insert new node with 
key = K aS appropriate son of current 
else release all ALPHA-LOCKs held by this 
process and terminate 
/* adjust balance fields between cn and 
new node as in sequential insertion */ 
aif k<key[cn] then begin 
direction <- -1; 
r <- current <- leftson[cn] 
end 
else begin 
direction <- +l; 
r <- current <- riteson[cn] 
end; | 
retrace path from current to new node 
changing bf appropriately; 
/* rotate if necessary */ 
case on bf[cn] 
O: bf[cn] <- direction; 
-direction: bf[cn] <- 0; 
direction: if bf[r] = direction then 
begin 
XI-LOCK father of cn; 
XI-LOCK cn3 


Z2o2 


do single rotation; 
release all XI-LOCKs held 
this process 
end 
else begin 
XI-LOCK father of cn3 
XI-LOCK cn; 
XI-LOCK r; 
do double rotation; 
release all XI-LOCKs held by 
this process . 
end 
esac 
release all ALPHA-LOCKs held by this process 


by 


In this algorithm, a writer ALPHA-LOCKs its 
path during the search phase so that the nodes 
along this path from the father of the cn to’ the 
place of insertion remain locked for the inser- 
tion, rebalancing, and rotation operations. This 
has the effect of locking subsequent writers out 
of the entire subtree rooted at the father of the 
cn rather than just those nodes on the first 
writer’s path. 


Claiming solution 


The second solution uses the first as a 
Starting point and introduces’ additional con- 
currency through the use of various parallelizing 
techniques. The goal in this next algorithm is 
to increase concurrency among writers by allowing 
writers whose restructuring paths (i-ee- the path 
between the father of cn and the newly inserted 
node) are disjoint to operate concurrently. The 
solution discriminates against writers that share 
the same path. Hopefully, the keys to be insert- 
ed by the parallel writer processes will tend to 
be spread throughout the tree rather than 
clustered. 

The strategy used is summarized below: 
writer process will search for _ the 
insertion using RHO-LOCKs and will place an ex- 
clusive claim on the probable father of cn ( the 
cn pointer is set to the last node on the inser- 
tion path with nonzero bf whose father is not al- 
ready claimed by another writer. As we shall see, 
this may not be the true critical node if another 
writer shares the insertion path). This node is 
claimed rather than locked so that another writer 


The 
place of 


may read past it and place a claim within this 
subtree if another potential cn is found. The 
new node with key k will be inserted while other 


inserting writers are excluded from the place of 
insertion. In its rebalancing phase, the writer 
excludes other rebalancing and rotating writers 
from nodes on the path from the father of cn to 
the first new node encountered on the insertion 
path (note that this new node was not necessarily 
inserted by this writer) and balance fields 
between cn and the new node will be adjusted. It 
is during the restructuring phase that writers 
which share the same path are penalized: one 
writer will claim the father of the lowest node 
with nonzero bf. Other writers will claim nodes 
higher in the tree and may find a lower cn during 
their rebalancing operations. Then the cn 
pointer must be moved down and the bf fields 


readjusted. Thus writers along a shared path may 
do useless work that will need to be undone. Fi- 
nally, rotation will take place if necessary with 
XI-LOCKs protecting the nodes involved. 

This solution requires a modification in the 


data structure of the previous algorithm. In ad- 
dition to the key, leftson, riteson, and bf 
fields, each node will contain one field, the 


guardian field, which will indicate which process 
is responsible for the rebalancing and rotation 
operations associated with the insertion of this 
node. If those operations have already been taken 
care of, this field indicates that this is an 
"old" node. In practice, this could be implement- 


ed with a single bit guardian field to signify 
"old" or "new'' and an associative table pairing 
new nodes with their guardian processes. Or 


since only three codes are used in the two bit bf 
field and all new nodes have bf=0, the remaining 
code could be utilized to indicate a new node, 
thus eliminating the guardian field altogether. 
This algorithm also uses a slightly different 
locking scheme. We will still need RHO-LOCKs' for 


reading and XI-LOCKs for exclusion of other 
processes during rotation. IOTA-LOCKs will be 
used to enforce mutual exclusion among writers 


trying to insert a new node at the same place. 
The most significant change lies in replacing the 
ALPHA-LOCK of previous algorithms with an 
ALPHA’-LOCK and a mark bit that explicitly imple- 
ment a lock to enforce mutual exclusion among re- 
balancing and rotating writers. The special 
feature of this implemented lock is that in addi- 
tion to the operations of requesting the lock 
(which implies the process is to wait if the re- 
quest can not be granted) and releasing the lock, 
a process will be able to test the mark bit to 
determine whether or not a node is locked without 
waiting for a lock request to be granted. The new 
compatibility relation is shown in figure 4. 
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LOCK 


EXCLUSIVE 
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= 


Figure 4 Compatibility graph for locks 
in Claiming solution. 


There are a few key ideas that promote con- 
currency in this. solution. The first important 
technique is to allow a temporary degradation of 
the tree structure. Since a writer can search 
and insert with a nominal amount of interference 
from other processes, it is possible that the 


tree could become quite unbalanced (i-e- no 
longer satisfying the AVL definition) after a 
number of processes have inserted but not yet 
restructured (cf. figure 5). The second technique 
is a relaxation of a process’s responsibility to 
do its own work. Let n, and ny be two newly in- 
serted nodes such that n, is an ancestor of n,. 
The restructuring operations associated with ny 
should be done before those associated with n,, 
but it is possible that the writer which inserted 
nN,» process 2, performs its rebalancing phase be- 
fore process l. In this solution, the two 
processes will essentially trade responsibilities 
with process 2 rebalancing for n, and process 1 
rebalancing for n,.- Because of the top-down na- 
ture of the restructuring pass, this trading must 
be explicitly done. A message will be sent to 
process 1 telling it to search for n,’s key dur- 
ing its restructuring pass. The fina technique 
is made possible by the new locking scheme. The 
new lock has a different effect on searching 
writers than on rebalancing writers, thus essen- 
tially delaying the blocking of one process. by 
another. 


"OLD' NODES 


NEWLY 
INSERTED 


NODES 


Figure 5 Modified AVL Tree. 


Readers in this solution are identical to 
readers in the Locking solution. The writer algo- 
rithm is given below. We first present the pro- 
cedures called by the main program followed by 
the main program itself. 


procedure TRY TO CLAIM; 


begin 
ALPHA’ -LOCK current; 
if bf[son] ~= 0 and current is unmarked 
then begin 


mark current; 
cn <= son; 
if claim ~= pointer to root then 
unmark claim; 
claim <- current 

end 

release ALPHA’-LOCK on current 

end 
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procedure TRY TO INSERT; 

begin . 
IOTA-LOCK current; 
determine appropriate son of current 
again; 
if son = NIL then 

insert newnode with key = k;3 

release LOTA-LOCK on current 

end 


procedure WAIT TO MARK(node); 
/* essentially ALPHA-LOCKing */ 
begin 
ALPHA’ -LOCK node; 
while node is marked do 
beg in 
release ALPHA’-LOCK on node; 
while node is marked do 
ALPHA’ -LOCK node 
end; 
mark node; 
release ALPHA’-LOCK on node 
end 


WRITER 


RHO-LOCK pointer to root; 
claim <- current <= pointer to root; 
son <= cn <= root; 


/* search and claim potential father of cn */ 


while son ~= NIL and k~=key[son] do 


beg in 
RHO-LOCK son; 


if bf[son]~=0 and current is unmarked and 


current~=claim then TRY TO CLAIM; 
release RHO-LOCK on current; 
current <=— son; 
determine appropriate son; 
if son = NIL then TRY TO INSERT 
end; 
if son ~= NIL then begin 
/* x=key[son] so no insertion necessary */ 
if claim ~= pointer to root 
then’ unmark claim; 
release RHO-LOCK on current; 
terminate 
end; 
release RHO-LOCK on current; 
/* rebalance - mark path from father of cn 
to place of insertion 
and adjust balance factor fields */ 
if claim = pointer to root 
then WAIT TO MARK(claim); 
WAIT TO MARK(cn); 
current <= cn; 
while guardian[current] is "old" do 
beg in 
if « < key[current] then 
egin 
r <- son <- leftson[current]; 
direction <- -l 
end 
else begin 
r <- son <- riteson[current]; 
direction <- +l 
end; 
WAIT TO MARK(son); 
while bf[son]=0 and guardian[son]="o0ld" 


ae 


E 
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do begin 
current <- son; | 
if x<key[current] then 
begin 
bf [current] <- -1l; 
son <— leftson[current] 
end 
else begin 
bf [current] <- +1; 
son <- riteson[current] 
end; 
WAIT TO MARK(son) 
end 
aif bf[son]~=0 then 
/* another process has been rebalancing 
on the path already*/ 
begin 
cn <= son; 
claim <= current; 
/* unmark from old claim to node above 
new claim and restore bf fields to zero 
from son of old cn to new claim */ 
current <- old cn; 
unmark old claim; 
while current ~= claim do 
begin 
determine appropriate son; 
bf[son] <- 0; 
unmark current; 
current <- son 
end 
current <- cn 
end 
else begin 
/*®* son hasn’t been rebalanced for yet */ 
if son is not this process’s new node 
then begin 
/* trade new nodes with process 
now responsible for son node*/ 
send guardian[son] message to reset 
its k to key [newnode] 
and its newnode pointer to newnode; 
guardian[newnode] <- guardian[son] 
end; 
current <— son 
end 
end 
/* rotate if necessary */ 
case on bf [cn] 
0: bf{cn] <- direction; 
-direction: bf[cn] <- 0; 
direction: if bf[(r]=direction then begin 
XI-LOCK claim; 
XI-LOCK cn; 
XI-LOCK r; 
do single rotation; 
release all XI-LOCKs held by 
this process 
end 
else begin 
XI-LOCK claim; 
XI-LOCK cn; 
XI-LOCK r; | 
XI-LOCK appropriate son of r; 
do double rotation; 
release all XI-LOCKs held by 
this process 
end 


esac 
guardian[current] <- "old"; 
unmark all nodes marked by this process 


Informal correctness proofs for these solu- 
tions can be found in [5]. 
Evaluation 
The primary goal in the design of these 


parallel algorithms was to increase the amount of 
concurrency possible between readers and writers 
and among numerous writers themselves. In the at- 


tempt to increase parallelism, a certain amount 
of parallel overhead was incurred (e.g. locking 
and unlocking of nodes, extra fields per node). 


Therefore concurrency and parallel overhead will 
be the two most important factors to be con- 
Sidered in the evaluation of our algorithms. 
Results of simulation experiments will be summar- 
ized here. For a detailed discussion of the 
simulation study and a more complete presentation 
of the results see [5]. Very briefly, the ap- 
proach is to simulate a fixed number of reader 
and writer processes executing steps of these al- 
gorithms scheduled according to randomly generat- 
ed execution times which reflect fluctuations due 
to factors such as memory interference and 
differences between physical processors. 

Among the measures of interest are the aver- 
age number of concurrently busy processes during 
an interval of time (where busy means that the 
process is not waiting to be able to lock a node 
and is not finished with its operation), the in- 
provement ratio (i-ee. time for sequential steps / 
elapsed time for parallel execution), average 
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Figure 6 Concurrency among 


readers and writers. 
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path length and longest path (indications of how 
the tree degrades), and measures of the degree of 
locking in the tree and the amount of work done 
in placing and releasing locks. 

Figure 6 deals with the amount of concurren- 
cy achieved by each solution as the tree grows. 
The results displayed are based on experiments 
with 16 readers and 16 writers active in the sys- 
tem. As could be expected, the first solution 
allows much less concurrency among writers than 
does the second algorithm. Also not’ surprising- 
ly, there is a considerable amount of parallel 
overhead involved in executing these algorithms. 
One approach to evaluating this overhead requires 
the processor utilization and the improvement ra- 
tio to yield a value x which indicates that the 
cumulative time to execute all steps of each busy 
process is about x times the execution time spent 
doing work that would correspond to steps of a 
sequential execution. This value is 2.7 for our 
first solution and 2.5 for the second. 


The degradation in the tree structure for 


the second algorithm is not significant for a 
randomly chosen set of keys. 
The average number of locks’ placed and 


released per insertion is used to estimate the 
overhead cost of locking in the following way: 
L=(cost of placing ALPHA-LOCK x average 


number of ALPHA-LOCKs placed) + (average 
number of XI-LOCKs placed) + (average 
_ number of RHO-LOCKs placed). 
Let I be the average insertion path. Then we 


have for the first solution, L=(3 x I) + 1-1 + 0, 
and for the second solution, L=(3 x 6-4) + 1.5 + 
Ls 


Finally, the maximum number of locks which a 
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writer would be expected to hold at some time 
during its insertion is a measure of potential 
concurrency- Since there is very little differ- 
ence between these two solutions with respect to 
this measure, the concurrency among writers exe- 
cuting the second algorithm can be explained by 
the delay in locking rather than fewer locks 
held. 

With regard to storage overhead, we compare 
the requirements of these concurrent solutions 
with the data structure of the sequential solu- 
tion. One notable difference is the space which 
must be devoted to the various locks. In addi- 
tion, the second algorithm calls for an associa- 


tive table pairing active writer processes with 
their newly inserted nodes. 
Conclusion 
In this paper we have presented algorithms 


for concurrently searching and inserting in AVL 
trees. The solutions illustrate parallelizing 
techniques such as relaxing a process’s responsi- 
bility to do its own work, allowing limited de- 
gradation of the structure, and delaying locking. 


These techniques should prove to be useful for 
introducing concurrency in other problems. The 
measurements presented indicate a reasonable in- 


crease in the amount of concurrency achieved by 
applying these techniques. In spite of the over- 
head, parallel execution yields a speed-up. 
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A TREE MACHINE FOR SEARCHING PROBLEMS | 


Jon Louis Bentley 
H. T. Kung 
Department of Computer Science 
Carnegie-WMellon University 
Pittsburgh, Pennsylvania 15213 


Abstract -- In this paper we describe a new 
(suitable for VLSI 
implementation) that solves a large class of searching 
problems. A set of N elements can be maintained on an 
N-processor version of this machine such that 
insertions, deletions, queries and updates can all be 
processed in 2 lg N time units. The queries can ve very 
complex, including problems arising in ordered set 
manipulation, data bases, and statistics. The machine 
is pipelined so that M successive operations can be 
performed in M-1 + 2 lg N time units. In this paper we 
will study both the basic machine structure and the 
actual implementation of the machine. 


tree-structured machine 


1. Introduction 


Very Large Scale Integrated circuitry (VLSI) has 
been increasing in speed and decreasing in size at an 
amazing rate over the past several years, and it 
promises to continue at this rate far into the next 
decade. In this describe a 
tree-structured machine for solving searching problems 
that is ideally suited for implementation in VLSI. The 
searching problems that it solves arise in a number of 
applications areas (including ordered set manipulation, 
data bases and statistics), and the machine is able to 


solve all of the problems very efficiently. 


paper we will 


Before we describe this machine in detail, it is 
important to characterize its contribution in general 
terms. The authors believe that there is a spectrum of 


impacts that advances in VLSI technology will have on 
computer architecture. At one extreme, this technology 
will allow conventional architectures to be implemented 
as smaller and faster machines -- this will lead to more 
sophisticated interconnections of conventional 
machines (see Swan, Fuller and Siewiorek [1977], 
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Browning [1979] and Sequin, Despain and Patterson 
[1978]). Also at this end of the spectrum will be minor 
architectural changes that exploit certain features of 
VLSI; this area has been explored by Sites [1979]. At 
the other extreme, VLSI architectures have been 
proposed that are radical departures from the von 
Neumann tradition (see Backus [1978], Mago [1979] 
and Wilner [1978]). in this paper we will investigate 
an approach that lies between these two extremes: a 
high-performance, special-purpose, non-von Neumann 
computing device that is designed to be used in 
conjunction with a typical computer. In general, such 
devices should be constructed only when they solve a 


problem satisfying two criteria: the problem should 


currently consume large quantities of computer time, 


and the proposed special-purpose device must be much 
more efficient than conventional ways of solving the 
particular problem. When such a problem is identified it 
is reasonable to augment a general-purpose computing 
system with a special-purpose device for solving the 
problem; the structure of such a system is depicted in 
Figure 1. Many such special-purpose devices have 
recently been proposed; see, for example, Kung and 
Leiserson [1978]. 


Figure 1. General system structure. 


in this paper we will investigate a special-purpose 
machine for solving searching problems. This machine is 
described at an abstract level in Section 2, where we 
will also review some necessary background In 
searching problems. An architecture (that is, a user's 
view) of a potential machine is described in Section 3, 
and issues of implementing that architecture in VLSI are 
discussed in Section 4. Conclusions are then offered in 
Section 5. 
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2. the Abstract Machine 


In this section we will investigate the 


tree-structured searching machine at an abstract level, . 


apart from the details of architecture or implementation. 
The general 
maintaining a file of fixed-format records. We must be 
able to perform the operations of inserting a new 
record into the file, deleting an existing record from the 
file, updating records in the file, and querying the file 
to answer questions. Before we examine the general 
searching problem, we will investigate one searching 
problem in particular. 


That particular problem is called member searching. 
In its abstract form, it calls for maintaining a set of 
elements so we can determine if a new element Is a 
member of the set. In concrete applications, other 
information is usually also required. For example, if we 
find that a particular social security number is a member 
of a set of social security numbers, then we usually 
wish to retrieve other information (such = as 
Year-to-Date taxes). We will now investigate how the 
tree machine solves the abstract member searching 
problem, and then return in the next section to the 
complicating issues that arise in applications. 

Input Node 


Output Node 


Figure 2. Structure of the tree machine. 


The basic organization of the tree-structured 
searching machine is depicted in Figure 2. There are 
three kinds of nodes in the machine: circles (which 
broadcast data), squares (which have limited storage 
and computation power) and triangles (which "combine" 
answers to queries). A set of N elements is stored in 
this machine by placing each element of the set into a 
distinct square node of the tree. Consider now the 
problem of performing the member search to answer the 
query "Is 17 an element of the set?". We accomplish 
this by inserting 17 into the input node = and 
broadcasting it down the tree -- Ig N steps later the 
value 17 will arrive at all of the squares. This situation 


searching problem it solves calls for 


is illustrated in Figure 38a. At that point we compare the 
values stored in each square to 17 and set a bit to one 
if the value is equal to 17 and zero otherwise; this is 
shown in Figure 3b. We can now combine the bits 
together through the bottom portion of the network by 
letting each triangle compute the logical or of its two 
inputs, as illustrated in Figure 3c. So after a total of 
2igN time units have passed since the query was 
posed, a single bit emerges from the output node telling 
whether or not 17 is an element of the set. We have 
thus described a procedure for determining whether a 
given object is a member of the set whose elements 


are stored in the square nodes. 


It is important to note that the tree machine has a 
very regular data flow: the data moves in discrete 
steps in only one direction (from the input node to the 
output node). Thus if many successive elements are 
going to be tested for membership in the set stored in 
the square nodes, then the process of answering those 
queries can be pipelined. As the value of the first 
element to be tested is going down the tree, the next 
value can follow one step behind, and so on. If M 
successive tests are performed in this manner, exactly 
M-1 + 2iIgN time units pass between the entry of the 
first query at the top of the tree and the exit of the 
last of the answers at the bottom of the tree. 


The tree machine is able to solve many problems 
besides member searching. For example, if a multiset 
of elements (that is, a set in which one element can 
appear many times) were stored in the square nodes of 
the tree, we might wish to count how many times a 
given object appears in the set. We proceed exactly 
as we did for member searching, first broadcasting the 
given element through the circles to the square nodes. 
We load a one into each square if it is equal to the 
given object and zero otherwise, and then combine the 
answers by letting the triangles sum the values of their 
inputs. Another example is given by nearest neighbor 
searching. If we wished to find the element of the set 
that is closest to 17, then we do the following: 
broadcast 17 through the input node to ail squares, 
subtract the value stored in the square from 17 and 
take the absolute value of the difference, and finally 
take the minimum of all those values by having the 
triangles return the minimum of their two inputs. Note 
that both for member counting and nearest neighbor 
searching, we can answer a single query in 2 Ig N time 
and a series of M queries in M-1 + 2 Ig N time. . 


In general, the tree machine can solve any problem 
that can be phrased as computing some function over 
every element in the set (such as equality or absolute 
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a.) 17 is broadcast. 


c.) Answer is returned. 


Figure 3. A member search. 


value of difference) and then combining the values of 
those frnctions by some associative, commutative 
binary operator. For example, the rank of an element X 
in a set (that is, the number of elements in the set less 
than X) can be calculated by storing in each square a 
one if the element is less than X and zero otherwise; 
the final answer is then computed by having the 
triangles add their inputs. Other problems defined on 
totally ordered sets that can be solved by the tree 
machine include predecessor (what is the greatest 
element less than the given?), successor (what is the 


least element greater than the given?), and minimum 
(what is the least elemeni in the set?). In general, the 
tree machine can solve all of the "“Decomposable 
Searching Problems" defined by Bentley and Saxe 
[1979]. That reference contains both an algebraic 
definition of the class and a list of many particular 
problems. 


The tree machine is also able to answer much more 
complicated kinds of queries (of the form that arise in 
data base applications, for instance). Suppose, for 
example, that every square node of the tree contains a 
record with ten keys. We might want to know how 
many records there are in the file with first key eaual 
to a given value, second key at least as great as the 
third key, the fourth key in a certain range, and so on. 
This type of query is easily answered: we merely 
broadcast each of the conditions down to the square 
nodes, keeping track in each node of whether it has 
satisfied all the conditions shipped so far. We load a 
one if all conditions have been satisfied and a zero 
otherwise, and combine by having the triangles sum 
their inputs. Many applications call for a list of the 
satisfying records instead of merely their count, and 
this can be accomplished by letting the triangles 
compute the union of their inputs. This can be viewed 
intuitively by observing each triangle independently, 
and imagining a person "tapping" the entire machine at 
each time step. As each triangle is tapped, there are 
three cases to consider: if it has no items in its inputs, 
it reports that; if it has one item, it returns it; and if it 
has two items, it returns only one (delaying the other 
until the next tap). This "tapping" process continues 
as long as there are elements that have yet to be 
reported (note that to compute unions in this manner, 
the pipelining must be "turned off"). 


Having discussed searching at some length, we will 
now turn to the issues of maintaining the set of 
elements stored in the square nodes. A tree machine 
with N square nodes (where N is a power of two) can 
store up to N records. A new record can be inserted 
into the set by placing it in an unused square. We find 
such a square by having each circle keep track of the 
number of unused descendants in each of his two sons. 
When a request comes to the root for a new (unused) 
position, he passes the request to one of his non-empty 
sons, and so on. Mechanically, this is accomplished by 
turning cff all of the processors except the one finaily 
chosen as the holder of the new record; this processor 
is then loaded with the desired data. Note that a single 
record can be inserted in Ig N steps, and a set of M 
records can be inserted in M-1 + ig N steps. 
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Another maintenance operation is that of updating a 
set of records: this can be easily accomplished by 
broadcasting the conditions that the changed records 
must meet, turning off all processors that do not meet 
the conditions, and then making the desired changes. 
(Although the update set will often have just one 
element, an example of a "mass update" might be 
processed on the first of the month: for all salesmen 
with Month-Of-Starting-Employment equal to 
This-Month, add one to Years-Of-Service.) To delete a 
single record we set a flag in its square node saying 
that it is unused and then adjust the counts in all of the 
This can be accomplished either by 
pushing information "backward" to the top of the tree 
(adding one to each counter as you go), or by doing a 
dummy reinsertion of that element, and modifying the 
counters on the way down. The time for either of these 


circles above it. 


operations is proportional to lg N. Notice that if a set of 
elements is to be deleted, this can be accomplished in 
psrallel and all counters can be reset (by pushing the 
information up the tree) in Ig N steps. Although having 
information go up the tree is handy for deletion, it does 
complicate the basic design severely; this feature 
might therefore not be implemented. 


So far in our discussion each machine’ has 
represented but one set. In some applications, 
however, a given user might wish to represent many 
sets, or many users might want to use the machine 
independently. Either of these can be accomplished so 
long as the sum of the sizes of the sets is less than N, 
the number of square nodes. Although we might be 
tempted to the machine into sections to 
accomplish this, there is a much more elegant solution. 


"slice" 


Namely, a fixed portion of each record is dedicated to a 
"set identification field", or "SetiID". To process an 
operation on Set 56 (or a set belonging to user 56), we 
have as a prelude to the operation the sequence 
"check SetiD for equality with 56 and turn off the 
processor if not equal". (Notice that we are not 
requiring that all records in all sets be of the same 
format, but just that they have one field in common.) In 
an environment with much sharing, this prelude will 
occur so often that it might be advantageous to provide 
a single instruction that accomplishes its purpose. 


Although so far we have used the tree machine to 
solve only searching problems, it can be applied to 
many other problems as well. For instance, it can be 


used to sort a set of M elements in time proportional to 


M (as long as M is less than N, the number of square 
nodes in the tree machine). This is accomplished by 
making two passes through the M elements: the first 
inserts the elements into the machine, and the second 


counts for each element the number of elements less 
than it (that is, it computes the element's rank, as we 
saw before). This tells precisely where each element 
occurs in sorted order (the output is a permutation 
vector), and it is then trivial to arrange the elements 
into sorted order. By use of pipelining, both steps run in 
time linear in M. Note that it was critical to phrase 
sorting as a counting problem, rather than as extracting 
the minimum, to make use of pipelining in the second 
step -- this algorithm essentially implements an N2 
algorithm in N time by using all N processors in parallel. 
There are many other examples of such speedups for 
problems that are not prima facie searching problems. 
Two such examples are computing all nearest-neighbor 
pairs in a k-dimensional point set (which arises in data 
analysis) and reporting all pairwise design rule 
violations in a VLSI mask (a design automation task). 
The application of this machine to the problem of 
constructing mininum spanning trees has been 
discussed by Bentley [1979]. 


This concludes our discussion of the machine at an 
abstract level, and we can now state the properties 
that a concrete embodiment of the machine must 
possess. There must be three kinds of nodes in such a 
machine: circles, squares and triangles. The triangles 
must broadcast data and have a small amount of state 
(namely, remember how many unused squares are 
descendants of each of their sons). The only 


_ processing required of a triangle is incrementing or 


decrementing by one. The squares, however, must 
have substantial memory and computation power. Each 
square must have enough memory to store the largest 
record in any forseeable application, and enough 
processing capability to handle the most difficult kinds 
of queries and updates desired. The circles must be 
able to combine answers. Most of the “combinators" 
we desire are very simple to implement; these are and, 
or, min, max, and plus. The only complicated combinator 
is union, and we are willing to "turn off" pipelining in 
the presence of that operator. 


3. An Architecture 


In Section 2 we investigated the tree-structured 
searching machine at an abstract level, ignoring many 
issues of implementation. In this section we will move 
one step closer to an implementation, and describe a 
particular architecture (that is, a user's view of the 
machine) realizing the abstract machine. It is essential 
that the reader understand that the architecture we 
will investigate is not proposed as the best possible 
architecture realizing the abstract machine of the last 
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section. Rather, it is put forth only as evidence that 
there is at feast one reasonably efficient architecture 
for the machine. In Section 4 we will discuss how this 


architecture can be implemented in VLSI. 


The basic structure of the architecture we will 
investigate is that studied in Section 2 (illustrated in 
Figure 2). The flow of instructions and data in the 
machine is exclusively from the input node (at the top 
of the figure) to the output node (at the bottom) -- we 
will not have deletions employ any "backwards flow". 
The machine is based on 16-bit instructions and 32-bit 
data words (which are interpreted either as integers in 
two's-complement or as 32-bit vectors). The top data 
paths in the machine (the son links from circles in 
Fiqure 2) are 16 bits wide; the bottom data paths (links 
to triangles) are 80 bits wide. The entire machine 
operates synchronously; is (perhaps) 
performed at each node and data is transmitted from 
the node to its son on each major cycle. Having 
described the machine at this gross level, we wil! now 


an operation 


examine the circles, squares and triangles individually. 


The primary function of the circle on each major 
cycle is to broadcast what it just received to its sons 
(on the next cycle). In only three contexts must it 
perform a more sophisticated operation. As a new 
element is being inserted, it must decide which way to 
direct the insertion (to one of its nonempty sons) and 
tren decrement the appropriate counter by one; it then 
ships a "no-op" to the other son. The no-op is 
effected by having one bit in the instruction turned off 
as the 16-bit instruction is passed to the "other" son. 
To accomplish a deletion we insert an_ instruction 
packet of three 16-bit instructions at the root node. 
The first instruction is the deletion and the next two 
16-bit words contain the binary address of the node to 
be deleted. The circles can tell by looking at the 
leading bits of the address whether they should 
increment one of their counters as they see _ this 
instruction. The final capability the circles must have is 
that of passing data to the squares, without 
interpreting that as an instruction to them; we will 


return to this issue as we discuss the squares. 


While the circles have the simplest architectures of 
the three units we will see, the squares have the most 
difficult. The abstract machine requires that the 
squares be able to store data and to perform enough 
calculations to answer queries and perform updates. 
This architecture will accomplish both these tasks by 
shipping combinations of instructions and data to the 
machine. We now have to make a fundamental design 
decision: should the individual squares’ be 


special-purpose devices (honed for a particular view of 
the tree machine's task), or should they be (in some 
limited sense) general-purpose computing devices? We 
will choose the latter course, and make each square a 
"baby" von Neumann computer; it is important, however, 
to emphasize that this is merely a design decision and 
not an inherent property of the abstract machine. 


Each square will be a small von Neumann-like 
processor that receives its instructions and data from 
an external, 16-bit stream. An individual processor 
contains sixteen 32-bit words of memory, two 32-bit 
registers, and a vector of eight single-bit data flags. 
The processor also contains an_— eight-bit Set 
identification number (SetiD), and an Instruction 
Register. The first bit of the Flag vector (Flag[O], 
abbreviated F[O]) is used as the "Active" bit of the 
processor; a special "Enable" command turns on all 
processors, and a processor can conditionally turn itself 
off by storing a zero in Flag[OJ. The basic layout of the 
machine is shown in Figure 4 (notice that because the 
machine is rotated 909%, the data flows from right to left 
rather than from top to bottom). 


ae 


Figure 4. Components of the square. 


The 16-bit instruction 
processor is shown in Figure 5. 


format for the square 
The first bit of an 
instruction processed by the squares is always zero; a 
one in that bit signifies an instruction that is ignored by 
the squares but passed on to the triangles. The two 
Fam bits specify one of the four families to which an 
instruction can belong (Arithmetic-Logical, Load-Store, 
Bit or Special), and the Code gives the opcode of the 
instruction. There is a one bit flag (Bit) in each 
instruction, and every other instruction has as _ its 
arguments either two four-bit addresses (A1 and A2), 
an 8-bit string (Name) or a five-bit integer (Num). The 
actual instructions are described by group in an ISP-like 
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language in Table 1. All of the arithmetic-logical 
instructions are. zero-address instructions, combining 
registers RA and RB and storing the result in RA. The 
load-store instructions specify one of 16 memory 
addresses as their operand; the data movement is then 
between that address and the register RB. All of the 
bit operations have two addresses; they combine the 
first and the second operands, storing the result in the 
first. The only exception to this pattern is the compare 
(comp) operation; it compares RA with RB and stores in 
the first bit (A1) whether or not the values are equal 
and tells which inequality in the second bit (this is just 
a straightforward encoding of three states into two 
bits). 


Figure 5. Instruction format. 


The only instructions that are noi entirely obvious 
are the special instructions. The enab/e instruction 
turns on all processors in the tree. The ins (insert) 
instruction turns on precisely one processor, turning off 
the rest (and decrementing the counters in the circles). 
The de/ (delete) instruction has no effect on the 
processors; it only increments the appropriate counters 
in the circles (the squares must ignore the two 
following instructions packets, though -- they are just 
the processor address). The ship instruction allows 
data to enter the RB register from the data/instruction 
stream. The Flag bit tells whether the next one or two 
16-bit packets should be loaded into RB; the data can 
then be processed as desired. The chksid and setsid 
instructions are for manipulating the 8-bit SetiD 
register; the former turns off the processor if SetiD is 
not equal to Num, and the latter loads the SetiD field 
from Num. 


To illustrate the operation of the processors we will 
study two program segments for performing searches. 
The first segment is for member searching. 


chksid ThisSet // Turn off bad squares 


ship Two // The next two packets 
data, // contain the 

datap // + comparand 

tab // Put comparand in RA 
Idb KeyAd // Put key in RB 

comp 1,2 


// Answer is in F[1] 


Arithmetic-Logical 


add > RA«RA + RB 

sub ~ RA« RA - RB 

neg > RAe«-RA. 

rand > RAe¢RAA RB 

ror > RAe«¢RAV RB 

rxor > RA«RA® RB 

rnot > RAe~RA 

shift Num > RA&€ RA left shifted by Num 
tab > RAe&RB 

tba ~ RBeRA 

swap > RA®© RB 

Load-Store 

Idb Num > RB¢ M[Num] 

stb Num > M[Num]« RB 

Bit 

band A1,A2 73 F[A1]¢ F[A1] A F[A2] 

bor A1,A2 > F[A1]«¢ F[A1] V F[A2] 
bxor A1,A2 > F[A1]«¢F[A1] F[A2] 

bnot A1,A2 73 F[A1]«¢~ F[A1] 

comp A1,A2 > F[A1]«¢ RA=RB; F[A2] < RAS RB 
Special 

enable > F[O]< 1 

ins + F[O]« this processor selected 
del ~ (defined in text) 

ship Flag ~ (defined in text) 

chksid Num -~ F[O]«¢ SetiD = Num 

setsid Num ~~ SetiD « Num 


Table 1. Instruction set for squares. 


The search key enters the RB register from the data 
stream and is then transferred to the RA register. The 
program then loads the key field of the record into the 


RB register (KeyAd is an integer identifying which of 
the 16 memory words holds the key), and makes the 
comparison. Flag[1] is then one if and only if the 
record's field is equal to the data shipped in the 
stream. At this point, the answer can be combined in 
the triangle network. 


The next program that we will examine arises in 
"nearest neighbor" searching; it computes the distance 
between the data and the key field of the record. 
Since we desire the absolute value of the difference of 
the key and the data, we must have a conditional step 
in our program. 
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chksid ThisSet 

ship Two // RA & Data 

data, 

datap 

tab 

Idb KeyAd // RB «+ Key 

comp 2,0 // \f DatasKey, 

swap // leave processor on 
chksid ThisSet // Turn all processors on 
sub // RA «& |Key-Data| 


The crucial step of this program is the comp instruction: 
if Data is greater than Key then a zero is stored in 
F[O], which turns the processor off; the swep then 
interchanges key and data. The next instruction 
(chksid) turns all appropriate processors back on, and 
the subtract correctly computes a positive value. The 
triangles can then be instructed to return the minimum 
of these values. 


The two code segments that we have just seen 
illustrate many of the aspects of coding the tree 
machine. Many other examples have been coded, and 
all of them appear to be fairly efficient. More 
quantitatively, the ratio of tree machine instructions to 
"critical" operations in the task clusters very closely 
around 2.5. This statistic is evidence for the 
vindication of our design decision to make the squares 
general-purpose machines, rather than special devices 
tailored to the searching task domain. (Pursuing that 
alternative remains an interesting open problem.) 


Before ending our discussion of the squares, it is 
interesting to compare the design of the processor with 
a more typical von Neumann processor. In some ways, 
we faced exactly the same problems: the choices of 
data representation, instruction formatting, operation 
set, and addressing were all taken from the von 
Neumann design space as discussed by Blaauw and 
Brooks [1979]. On the other hand, we avoided many of 
the issues faced by designers of typical machines; 
these include instruction sequencing, interrupt handling, 
and input/output control. 


Before we discuss the architecture of the triangle, 
we must settle one more point about what we want it to 
do. in most applications that compute the minimum of a 
set (for instance), we want to know not only what the 
value of the minimum is but also what element has that 
value. We therefore have three objects assoclated 
with computing the minimum: the operation (minimum), 
the value, and the name (which is a 32-bit word 
associated with the value; its address or "key" In many 


applications). When combining two such objects, we 
take the value as the minimum of the two values, and 
the name from the name of the smaller value. The name 
is thus inherited from the minimum. We will also 
associate names with other binary operators: the name 
of maximum is inherited from the node with greater 
value; for p/us, from a nonzero element; for or, from a 
nonzero bit vector (arbitrary if both are zero); and and 
from a zero bit vector. 


Having defined the concepts of value, name and 
inheritance, it is straightforward to describe the 
architecture of the triangles. They will operate on 
80-bit packets: 16 bits of instruction, and 32 bits each 
of value and name. Computing min, max, plus, and, and 
or are all simple. Union is a bit more detailed, but also 
conceptually straightforward. One aspect that we have 
not mentioned is the interface between the squares 
and the triangles; we must include instructions for 
transferring the contents of the RB register to the 
name or value field of the triangle immediately beneath 
it (these could be included in the load-store family). 
This allows us to give complete programs for answering 
queries. After computing the answers (as illustrated in 
the two segments shown above), we load them into the 
desired fields of the triangles, and combine them as 
desired. 


it is important to emphasize that the architecture we 
have just seen is not the architecture that the ultimate 
user of the machine will see. Rather, there will be a 
hierarchy of functions available to him. At the highest 
level, he will be able to perform operations on sets 
(load a set, erase a set, for each element in the set, 
and so forth); at an intermediate level there are 
record-handling (defining queries or 
inserting, deleting and updating records); and at the 
lowest level the machine instructions 
themselves. At the lowest level the user can make 
very efficient code by knowing the details of the 
machine; at the higher levels he sacrifices efficlency 
for clean and easy code. 


operations 


there are 


An important part of the implementation of this 
architecture is that there be a fairly sophisticated 
device controller for the tree machine (such as an 
off-the-shelf microprocessor). This controller will 
implement the hierarchy of functions mentioned above. 
This will also reduce the bus activity substantially by 
having the controller fetch items from main memory and 
issue instructions to the tree machine; having the CPU 
itself perform these tasks would lead to a substantial 
degradation in overall system performance. 
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4. Discussion of Implementation 


In this section we will discuss an implementation of 
the architecture of Section 3 in VLSI technology. Just 
as with the architecture, we are not proposing this 
implementation as the best of all possible, but rather as 
one that is reasonably efficient. The fundamental 
description of the implementation is that it is bit-serial. 
There are two motivations for this: one, to exploit the 
shift-register technology of VLSI, and two, to use very 
few pins on packages. 


The implementation of both the circles and the 
triangles described in the last section’ is 
straightforward. The squares are also easy to 
implement bit-serially. The 16-word memory is in fact a 
parallel shift register, 16 bits wide and 32 bits long. 
The two registers RA and RB are also shift registers. 
To load or store a word, RB and the memory shift 
register are shifted in parallel, and the memory 
controller of Figure 4 is just a multiplexor (decoding a 
4-bit address to one of 16 lines). All of the 
arithmetic-logical operations are accomplished by 
putting a single-bit function box between the RA and RB 
registers, and then shifting the pair through it (all 
operations require at most one bit of memory). Notice 
that we have assumed that the squares have 32 minor 
cycles during each major cycle of the machine. The bit 
operations are straightforward to implement if the Flag 
array is just a small RAM. Estimates by experienced 
VLSI designers indicate that the chip area for the 
functionality in the square is about equal to the chip 
area required for the 512-bit memory. Using current 
technology, it is easy to imagine putting 16 squares on 
a single chip. 


Now that we know how we will implement the 
individual processing elements (circles, squares and 
triangles), we must describe how to place them on a 
chip. The first simplification we will make is to consider 
them as standard binary trees rather than the 
"mirrored" binary tree of Figure 1; the unmirroring 
process is illustrated in Figure 6. We now face the 
problem of laying out a binary tree on a chip. This 
problem has been studied at length by Mead and 
Conway [1979], who suggest the space-economical 
layout illustrated in Figure 7. The amount of space 
used in that layout is proportional to the number of 
processors on the chip. Note that each edge in that 
layout is realized by two "wires" on the chip -- one for 
data going to the squares, and one for data coming from 
the squares. | 


Figure 6. “Unmirroring" the tree machine. 


cer tenrans enntnlagrecmnemaet eee 


Figure 7. Tree layout on a chip. 


Since only a small number of the processors in a tree 
machine will fit on a single chip, it is important that we 
discuss the packaging of the chips. The packaging 
strategy we propose is illustrated in Figure 8. There 
are two kinds of chips in that figure: the /eaf chips and 
the internal chips. The leaf chips contain (say) 16 
square nodes and 15 circle and triangle nodes. All the 
communication to a leaf chip is through two wires, so 
the chip needs only two communications pins (besides 
power and timing synchronization pins). Notice that this 
implies that with technological advances in VLSI, we will 
be able to place many more processors on a square 
chip; we are not bound by pin limitations. The internal 
chips would probably be constructed with seven circles 
and triangles on them; this implies that there is one 
input-output pair of wires at the top of the chip and 
eight pairs at the bottom. The total number of pins for 
this chip is therefore eighteen (plus miscellaneous 
pins). This chip is therefore pinbound even in today’s 
fabrication technology; unless there are unexpected 
advances in packaging technology, the internal chips 
will probably continue to have seven or at most fifteen 
pairs of circles and triangles. 


To get a better feeling for the size of the tree 
machine, we will briefly consider how one might be built 
today. Suppose that we put sixteen square nodes on 
each leaf chip, and seven circle-triangle pairs on each 
internal chip (both of these are easily accomplished in 
today’s technology). We will now put 64 leaf chips and 
nine internal chips on a board; this gives us 1024 


square nodes. We can then put sixteen of these 


boards in a small cabinet, giving a tree machine of more 
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Internal 
Chip 


Leaf Chip 


Figure 8. Two kinds of chips. 


than sixteen thousand square nodes, each holding a 
512-bit record. lf we assume that technology 
continues to double the number of components on a chip 
every two years, this implies that we can expect a 
tree machine of one million records to fit in about a 
cubic foot of space by the end of the 1980's. 


These rough (but fairly conservative) estimates 
indicate that the tree machine might be one reasonable 
way to exploit the processing power that VLSI will give 
us. Before we can assert this with confidence, 
however, we must show that the tree machine is a 
wiser way to invest resources than other structures for 
For example, might it be better to put the 
same resources into a large RAM memory rather than a 
tree machine? The fact that the functionality of the 
tree machine is about equal to the chip resources 
required by the memory shows that if we were to go 
with a RAM instead of a tree machine, we could get at 
most twice the memory (and simultaneously lose a lot of 
functionality). It therefore appears that for searching 
applications, the tree machine is better suited than the 
RAM. The detailed comparison of this architecture to 
its competitors remains an open problem. 


searching. 


5. Conclusions 


In this paper we have investigated the tree machine 
for searching problems on several levels. In Section 2 
we studied it in an abstract setting and showed that it 
can rapidly solve many searching problems, as well as 
some other problems that do not immediately appear to 
be searching problems. In Section 3 we saw an 
architecture (that is, a user's view) of the machine, and 
in Section 4 we saw that that architecture can be 
efficiently implemented in VLSI technology. Having 
studied the machine at these various levels, we will 
now spend ae few moments summarizing’ the 
contributions of this work. 


This machine can be compared with many other 
architectures. It is similar to an associative memory in 


many aspects, but it can perform many more operations 
than even the most powerful associative memories 
considered to date (see, for example, Lamb and 
Vanderslice [1979]). One might consider the square 
processors as forming a Single-Instruction, 
Multiple-Data stream (SIMD) computer, but each sutiare 
is considerably simpler than most SIMD machines 
to date. The tree machine is aise 
superficiaily similar to the CASSM compviter of Su et a/ 
[1979], but there are fundamental differences in the 
two machines at both the architectural and 
implementation levels. Two other machines to which it 
might be compared are the tree-structured machines of 
Mago [1979] and Sequin, Despain and Patterson 
[1978]. 
forward as general-purpose computing devices, while 
our machine is much more specialized to the particular 
problem of searching. 


proposed 


Both of these machines, however, are put 


An interesting aspect of the tree machine is what we 
might call its "computational -structure", which is 
ilustrated in Figure 9Q. That diagram has’ three 
interpretations. First, it illustrates the tree machine 
itself: very small input and output channels, with 
massive computation going on in between. Second, it 
describes the searching problem: a small question is 
asked about a large set, giving a small answer. - And 
finally, the figure illustrates the constraints of working 
with pinbound VLSI: the number of pins on a chip is 
very small compared to the number of functional 
components. The fact that the abstract structure of 
both the searching problem and the tree machine’s 
solution to it closely model the medium of VLSI indicates 
that this approach might be very successful. 


’ 


Figure 9. A computational structure. 


To summarize the tree machine, the authors feel that 
The first is the 
abstract tree machine: it gives a number of nice 
"theoretical" solutions to a large set of problems. The 
second contribution is the = architecture and 
implementation we have proposed; they indicate that 
this machine might be a reasonable device to build as 
further advances in VLSI technology occur. Finally, we 
feel that the "computational structure" we just 
investigated provides an example of the kind of 


this work has three contributions. 
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argument that will justify special-purpose architectures 
proposed for implementation in VLSI. 
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Summary 


Binary decision trees can be found in several 
contexts. A logic example is a demultiplexer; a 
software example is a binary search tree. We show 
that the evaluation of any decision tree of size n 
can be speeded up to the Boolean complexity lower 
bound [log n]. 


A decision tree consists of a set of binary 
decision variables input, a set of result vari- 
ables output, and a binary tree. If the decision 
tree is of size n there are n result variables, 
n-1 decision variables, and the binary tree has n 
leaf nodes. Each result variable corresponds to 
a leaf node. Each decision variable corresponds 
to an internal node so that the left arc incident 
from the internal node is labeled by the decision 
variable, the right arc by its complement. Thus, 
a decision tree is similar to a nested IF...THEN 
.. ELSE... program. 


A decision tree will be denoted by an upper 
case italic letter: A, B, A Boolean vari- 
able is denoted by a lower case italic letter: 
ay b, es @ e 


We wish to speed up the serial evaluation of 
decision trees. A serial evaluation of a decision 
tree is a traversal of the tree from its root to 
one of its leaves. When a node is visited, its 
decision variable is evaluated. If it is true, 
the left successor node is visited; otherwise, the 
right successor is visited. Only the result vari- 
able of the leaf visited is set true; the others 
are set false. 


For example, Fig. 1 shows a decision tree to 
search for "dave", "david", or "dan". The deci- 
sion variables are denoted Ca> meaning the 4t 


character of the pattern is "c 


ables are denoted ry through rg: 


reached if all of the decision variables d 


". The result vari- 


Leaf 1 is 


iso? 
Then, result variable ry is 


Similarly, 


v., and e, are true. 


3 
true; results Lo 8 are false. 


are true for d,*ao°v,°e,°1, 


through r 


ed. and d.°- 


and r 5 l 


*) 5 


x 
This work was supported in part by the National 
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Definition: 


Definition: 


Definition: 


a,*v 
2.3 
is matched, one of the other result variables will 
be true. 


"Ns; respectively. If none of the patterns 


For certain decision trees, serial evaluation 
may take n-l steps. If the tree has the special 
form in which there is a chain of n-l internal 
nodes, the prefix computation [1] can be used to 
evaluate the tree in flog n|] time. For arbitrary 
decision trees, we propose a tree-height reduc- 
tion transformation which also yields a [log n] 
time bound. This transformation is similar to 
distribution applied to arithmetic expressions. 


Consider a binary tree Tf containing a 
subtree A. The arc path from the root of T to the 
root of A is called the factor of A. The factor 
of a decision tree is the product of the Boolean 
variables or complements labeling the arc path. 
Let p denote the factor of A in T. a 
If a decision tree A is conditioned 
by a Boolean variable p, the value of all of the 
result variables of A are false if p is false; 
otherwise, the values of the result variables are 
determined by evaluating A. a 


Let the decision tree Tf have a sub- 
tree A. By distributing A from Tf with remainder 
R, we mean: (1) replace the subtree A in T by a 
leaf node; call this tree R; (2) condition A by 
the factor of A, p. aa 
After distribution, each of p, A, and R are 
smaller than T and they can all be evaluated in 
parallel, resulting in a considerable speedup. 


Let A be the sub- 
The factor of A is 


Fig. 2 shows an example. 
tree of Fig. 1 rooted at ey 


p= dj°ay°v 35 that is, a result of A can be true 


only if d is true. To condition A by p, we 


1923 
1°72°"3 and AND the result with each re- 
sult variable of A as shown. It is easy to verify 


that result variable Lys Vo, OF re is true in Fig. 


is true in Fig. l. 


compute d 


2, only when ©); fe 4 


jas 
It may take 6 steps to evaluate Fig. 1, but 


it takes at most 4 steps to evaluate Fig. 2. d,° 


ay*V, may be evaluated in 2 steps in parallel with 
the evaluation of A, which takes at most 3 steps. 


Conditioning A takes only 1 more step. R can be References 
evaluated in at most 4 steps in parallel with the 


evaluation and conditioning of A. The evaluation [1] R. Ladner, and M. Fischer, "Parallel Prefix 
time of Fig. 1 can be decreased even further to Computation," Proceedings of the 1977 Inter- 
[log 8] = 3 steps by recursively distributing A national Conference on Parallel Processing 
and R. (Aug., 1977), pp. 218-223. 

The. proof that a logarithmic speedup results [2] R. Brent, D. Kuck, and K. Maruyama, ''The 
by recursive application of distribution is based Parallel Evaluation of Arithmetic Expres- 
on two lemmas similar to those in [2]. The first sions without Division," [EEE Transactions on 
bounds the size of the tree that can be distribu- Computers, (May, 1973), pp. 532-534. 
ted out (and also the size of the remainder tree). 
The second bounds the size of the factor. [3] R. Kuhn, Ph.D. thesis, in preparation, Dept. 

of Computer Science, University of Illinois 

Lemma 1: Any binary tree, T, of size n (n leaves) at Urbana-Champaign (1979). 
where n > 4 contains a subtree, A, such that n/4 < © 
[al < n/2. 7 . 
Lemma 2: Any binary tree, T, has a subtree, A, ( dy) 
satisfying Lemma 1 such that the factor, p, from 
the root of f to the root of A contains at most T 


[n/2] arcs. a (a,) 


Lemmas 1 and 2 insure that on each applica- 
tion of distribution there is a linear reduction (vs) 
in the size of: the factor, the tree distributed 
out, and the remainder tree. Thus, we can prove 


the following: (a. @ 


Theorem: Any decision tree of size n, n > 4, can 


be evaluated in [log n] gate delays with fewer 
than 5/4 n log n AND gates. = i) 


The gate bound of this theorem is more difficult 
to establish. It follows from the solution of the ( da) 
gate recurrence equation 


g(T) = g(p) + g(A) + g{R). Lote Ns a 


The function g(7T) may be bounded by the real val- 


ued function $(n) which satisfies Fig. 1. A decision tree 
@(n) = max (n+ 8(m) + 8(n-m)) 
n/4<m<n/2 


where n is the size of T and m represents the size 
of A. This equation is similar to the information 
entropy function whose solution is known. 


Although the time bound is low, there are 
decision trees whose time bound is lower 
(O(log log n)). That the gate bound is nearly 
linear for large n may be surprising. 


The concept demonstrated above has been 
studied with regard to implementation and applica- 
tion [3]. Implementation issues investigated in- 
clude limited fan-out (time: 2log n, gates: 
2n log n) and a programmable evaluation network 
(time: 3log n, gates: O(n log )). Applications 
that have been considered are: decision tables, 
numerical programs containing more conditionals 
than computations, and searching. We have also 
investigated the possibility of applying other 
tree-height reduction transformations to decision 
trees. Fig. 2. A distributed decision tree 
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Abstract -- To study the fault-diagnosis met- 
hod for the class of multistage interconnection 
networks a general fault model is first construc- 
ted. Specific steps for diagnosing single faults 
and detecting multiple faults in the intercon- 
nection networks such as the indirect binary n- 
cube network and the flip network are then deve- 
loped. The following results are derived in this 
study: (1) independent of the network size, only 
four tests are required for det®cting a single 
fault; (2) the number of tests are required for 
locating a single fault and determining the fault 
type ranges from 4 to max (12, 6+2 [log (log,N) 1) 
except for four types of single faults in the 
switching elements which cannot be pinpointed at 
the switching element level; (3) only four tests 
are required for locating a single fault if the 
switching element is designed in such a way that 
any physical defect in the switching element 
causes both outputs of the related switching 
element to be faulty; and (4) multiple faults 
can be detected by 2(1+1og,N) tests. 


I. Introduction 


The problem of fault-diagnosis in intercon- 
nection networks has received little attention in 
the literature. In this paper we investigate the 
fault-detection and the fault-location problems 
of a class of multistage interconnection networks 
[1]. The class of multistage interconnection net- 
works includes the modified data manipulator [2], 
the flip network [3], the indirect binary n-cuhe 
network [4], the omega network [5], and the 
baseline network. Since these networks are topo- 
logically equivalent, the fault-diagnosis scheme 
developed for one network can also be used for 
other networks after applying mapping rules des- 
cribed in [1]. In this paper, we use the base- 
line network in developing the scheme. 


The fault-diagnosis problem is approached by 
generating suitable fault-detection and fault- 
location test sets for every fault in the assumed 
fault model. The test sets are then trimmed to 
minimum or nearly minimal sets. In Section II 
we propose a fault model of a switching element 
and derive a test set for every: fault in the 
fault model. The objective of Sections III and 
IV is to develop a specific fault-diagnosis 
scheme for the network constructed of switching 
elements having direct- and crossed-connection 
capabilities as shown in Fig. 1. The fault-diag- 
nosis of single faults and the fault characteris- 
tics are discussed in Section III. The multiple 
fault detection problem is then considered in 


269 


states. 


Section IV. 
omitted because of the page limitation. 
rested readers are referred to [6]. 


The proof details of theorems are 
Inte- 


Fault Model and Test Set of 
a Switching Element 


If. 


A. Fault Model 


Generally, a switching element with two 
input lines and two output lines can be con- 
sidered as a 2 X 2 crosspoint switching matrix 
which may have as many as 16 states. Table l 
shows the set S of the 16 states and the related 
symbolic representation. In our proposed multi- 
stage interconnection network, a switching ele- 
ment is designed sc that only some of the 16 
states are used. We denote these states as valid 
In the flip network and the indirect 
binary n-cube network, the valid states include 
Ss and § The valid states in the omega net- 
work include S., S., S,,, and S,,. The number 
of valid states which a switching element can 
assume in order to achieve the network function 
depends on the physical design of the inter- 
connection network. A fault exists when a swit- 
ching element is in any one of the 16 states 
different from a given valid state. A fault in 
an interconnection network can be located either 
at a link or in a switching element, All dis- 
cussion in this paper is confined to solid 
logical faults. The fault located at a link can 
be considered to be one of line stuck types, 
i.e., stuck-at-zero (s-a-0) or stuck-at~-one 
(s-a-1). We use a functional approach to con- 
sider fault types in a switching element, For 
a switching element with n valid state, there are 
(16)" possible state combinations in which the 
faulty switching element can behave. We use the 
ordered set {(s,,8,, «++, S Y\s, es, 1l<i <n} 
to describe the state combinations and name each 
of the state combinations a functional state. 

As an example, Fig. 1 shows a switching element 
with two valid states, S and S.. Assume the 
first valid state is S,,and the second S.. The 
functional states of switching element can be 
expressed by a functional state set which is an 
inner product set of S, S X S. There are 256 
functional states. The state combination 

(Ss 0° S.) is the normal functional state and the 
ether 255 state combinations are faulty func- 
tional states of the switching element shown in 
Fig. 1. 


B. Test Set 


A test set should be developed for detecting 


every fault in the fault model described above. 
By test set, we mean the collection of tests 
which are needed for detecting all possible 
faults in the fault model. Faults to be detected 
and tests for detecting them are listed in Table 
2 for a switching element in valid state §S In 
Table 2 the detection of the link stuck fault is 
described in Part I. The superscripts of the 


link labels indicate whether the fault causes the 


link stuck at 0 or 1. For example, in the first 
row, we can see that if we apply an input (x, ,x,) 
=(1,0) to the switching element in valid state 

S.,, the normal output will be (2, » ¥,)=(1,0) and 


the fault, x° or X° of x, or & licking at 0 
will cause output Fo be tz, “x, te (0,0) We then 
say that the input (x Xo ) = I(t, 0) can detect the 


fault, x° or a of the 2 Switching element in 
valid state § The detection of switching ele- 
. ment faults is deseribed in Part II. Fora 
tee: element stuck in S., S,,-S., (see row 
Si” Sof Part II of Table Z), if we apply input 
os = (0,1), the faulty output will be 
(Ry. = (1,0) which is different from the 
bea Sitpue (%,,%,) = (0,1). In Table 2, 
means logically tae rind output and "d" means 
logically erroneous output where O and 1 are the 
simultaneous inputs. The ouput values of 
"_" and "o" depend on the circvit imple- 
mentation of the switching element. However, 
an arbitrary assignment of 0 or 1 to "=" or "9" 
would not affect the differentiation between the. 
normal output and the faulty output. Hence 
whether we can easily design an equipment to 
detect "—"" and "o'"' would not disturb our develop-~ 
ment of the test set for various faults. 


mot 


From Table 2 it can be seen that only two 
tests (x,,x,) = (0,1) and (x,,x,) = (1,0), are 
needed to detect all faults. The test vectors on 
x, and x, are 01 and 10, respectively. For an 
easy reference we define t = 01 and t = 10. The 
same conclusion can be drawn for a switching ele~ 
ment in valid state S.. The faults, the test in- 
puts and the test outputs of S, are shown in 
Table 3. Similarly, the above techniques can 
also be extended to detect faulty elements in 
other networks such as the omega network with 
additional valid states S43 and S19" 


III. Diagnosis of Single Faults 


A. Fault Detection 


An algorithm for deriving efficient test 
sets for the network will be presented. The 
basic idea is to establish connection paths and 
to label each link in a path with t (or t} such 
that. each switching element in the network has 
its two input lines labelled with the two test 
vectors (t and t), respectively. 


The connectiod . 


paths are established by putting switching ele- 
ments into a valid state to be tested. Since 
only one valid state of the switching element can 
be tested in a test phase, two test phases are 
needed for the switching element shown in Fig. l. 
In phase 1, we test the valid state shown in Fig. 
l(a) for all switching elements in the network 
and in phase 2, we test the valid state shown in 
Fig. 1(b). It would be an efficient way that 
each one of the switching elements has its two 
input lines labelled with two different test 
vectors in each test phase, These test vectors 
appearing on the input lines in each phase con- 
stitute a test set which can efficiently test all 
switching elements in the network. An algorithm 
for generating such an efficient test set for a 
network of size N is described as follows: 


Step 1: Label the top terminal link in the left 
side of the network with test vector 

= Ol. 
Assume the labelled terminal links are 
named 0,1,..., and m-1 from top to bot- 
tom and the next m unlabelled terminal . 
links are named, from top to bottom, 
m, m+l,..., and 2m-1, where 1 <m<N 
and m is in 2's power. Label terminal 
link mi with L(i) for 0 <i<mtl, 
where L(i) is the test vector assigned 
to terminal link i and L(i) is the 
complement, 
For the unlabelled terminal links in the 
left side, repeat Step 2 until all N 
terminals are labelled, 


Step 2: 


Step 3: 


The test set generated by the above algorithm is: 


‘good for both test phases and an example is given 


in Fig. 2. The labels on the input lines of the 
leftmost stage correspond to the required test 
vectors while the other labels indicate the 
fault-free response of the network to these test 
vectors, The fault-free response of the network 
shown in Fig. 2 assures that each one of the 
switching elements has its two input lines 
labelled with two different test vectors. These 
two different test vectors are exactly the test 
vectors needed to test each valid state of a 
switching element (see Section II), It can be 
seen that to detect single faults in a network 


four tests (two for each test phase) are nece- 


ssary and sufficient and the test length is 
independent of the network size, This fact is 
restated in Theorem l, 


Theorem 1: 
eee ee eee 


Four tests are necessary and sufficient for 
detecting single faults in a baseline network 
constructed of switching elements with two valid 
states shown in Fig, 2, 


B. Fault Location 
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The problem of locating a single fault can 
be partitioned into two subproblems: one is to 
locate the stuck fault at a link and the other is 
to locate the fault in a switching element, 


1. Link stuck fault: A link stuck fault 
can be located by intersecting the link sets of 
two faulty paths. A stuck fault at a link can 
cause one and only one faulty output at an obser- 
vable terminal in each test phase. Each faulty 
output should be equal to either 00 or 11 which 
is different from the normal output. Each link 
in the network can uniquely be identified by two 
paths, one from phase 1 and another from phase 2. 
The method to compute the link set in a path is 
previously defined [1]. The test set derived 
for detecting single faults can be used to locate 
a single stuck fault at a link. However, the 
link stuck fault may not be distinguishable from 
some switching element faults which will be dis- 
cussed later. In spite of this indistinguishabi- 
lity, there exists a one-to-one relationship bet- 
ween a link stuck fault and a faulty output 
pattern. 


Theorem 2: 


There is a one-to-one correspondence between 
the link stuck fault and the faulty output pat~ 
tern. The faulty output pattern is a necessary 
condition of the link stuck fault. 


that P = BeBe el ean geragae is the set of the 

faulty state which has one faulty output in valid 

states §S Thus, if the functional state of a 

switching element is one of the following: 

(S aa (S Se)s (Sg.S.)5 (S, 4255) > Sige ys 

and ( 4255 ,» we have only one faulty oufpuft at 

phase t test and no faulty output at phase 2 test. 

Similarly, according to Table 3 we can find that 

Q= {S,,8 18,0858) ,S,,} is the set of the 

faulty state which fas one faulty output in valid 

state S_, and any one of the following functional 

states: (S S.), (Ss Sa)y. CS So )-« So .48s) 
Q? > > > b ’ 9 

(Se 23 Sex 9's and S (gt?) Poedlve in one ees 

12 

output at the phase 2° test and no faulty output 

at the phase 1 test. Therefore, there are 12 

fault types of the one-response fault. 


Theorem 3: 


The fault location and the fault type of the 
one-~response fault can be determined by at most 
6421 log, (log.N) | or 6+2 |Log, (log.N) | tests. 


The number of tests indicated in Theorem 3 
is actually an upper bound of the number of tests 
for determining the fault location and the 
fault type at the switching element level. In 


some cases we may only need to locate the fault at 
the module level. Then the number of tests needed 
is less than this upper bound depending on the 
size of the modules, 


Example: Comparing the test outputs to the 
fault free output of a network of size 16, we ob- 
serve that the path of link set {7,6,2,0,1} in the 
phase 2 test lead to the faulty output 00 of out- 
put link 6 at the phase 1 test and output link 1 
at the phase 2 test. Intersecting the two link 
sets, we can locate the fault at link 6 of level 


Case 2: In this case the faulty switching 
element has only one faulty output in each valid 


1 which is stuck at zero. state. We have described in Case 1 that P = 
LSS 495545 ny Seas s uy band. 0 = 1S. 48.58 .49595. 55 
2. Switching element fault: A switching el- S ot ave the sete or Pauley states Ww tefl neve 


only one faulty output in valid states 5S 0 and Ses 
respectively, The possible faulty output com- 
binations of these two sets are depicted in Table 
4 in which there are six subcases: A, B, C, D, E, 
and F, There are 36 functional states in this 
.case, which form the inner product set of P and 


ment fault can be the result of any one of the 16 
states shown in Table 1. Single switching element 
faults in a network can result in several faulty 
output patterns. In terms of the response pat- 
terns of the detection phases the faults can be 
classified into four cases as follows: 


(1) One-response fault - There is only one Q, P xX Q, These 36 functional states are 
faulty output. This faulty output can classified into six subcases according to the 
be a terminal output at either phase l faulty output patterns as shown in Table 4, The 
test or phase 2 test; classification is shown in Table 5 in which the 
(2) Separated two-response fault - There are horizontal caption is for the faulty states of 
two faulty outputs. One of them is a valid state S and the vertical for the faulty 
terminal output at phase 1 test and the states of valid state S,, An example of reading 
other is a terminal output at phase 2 Table 5 is described below, Suppose (S455, 3) is 
test; a faulty functional state of (S, Sc). The B at 
(3) Nonseparated two-response fault - There the intersection of column S.,.and row S$ in 
are two faulty outputs. But, both of Table 5 implies that the switching element in 
the faulty outputs are terminal outputs functional state (S.,,S..,.) will result in a 
at either phase 1 test or phase 2 test; faulty output of a binary vector in a test phase 
(4) Multiple-response fault - There are and a $d faulty output in another test phase 


according to Table 4, In each subcase it can be 
seen that in some examples there are two common 


more than two faulty outputs. 
Each case will be considered in the following 


paragraphs. switching elements in the two faulty paths. One 
of these two common switching elements is faulty, 
Case 1: The set of switching elements in- Additional test sets should be derived in order 


to locate the fault within these two questionable 
switching elements, In some examples there is 
only one common switching element in the two 
faulty paths, In these examples the common 
swithing element is either in the rightmost 

stage or in the leftmost stage, 


volved in a faulty path is not sufficient to lo- 
cate a single fault at the switching element level 
since we have to pinpoint exactly one switching 
element in this set. Additional tests will be ne- 
cessary to determine the fault location and the 
fault type. According to Table 2 we find the 
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Theorem 4: 


The fault location and the fault type of the 
Subcase A fault can be determined by at most 8 
tests, independent of the network size. The 8 
tests includes 4 test for the detection phase and 
4 tests for locating the fault. 


Theorem 5: 


The fault location and the fault type of the 
Subcase B fault or the Subcase C fault can be 
determined by at most 10 tests, independent of 
network size. 


Number of tests needed for locating the 
faults of Subcases B and C is equal to 4 or 8 and 
two additional tests are needed for determining 
the fault type, 


Theorem 6: 


The fault location and the fault type of the 
Subcase D fault or the Subcase E fault can be 
determined by at most 12 tests, independent of 
network size. 


Actually, the total number of tests needed 
to determine the fault location and the fault 
type in Subcases D and E is equal to 8 or 12. 


Remark: The fault of Subcase F cannot be 
pinpointed at the single switching element level 
and it is indistinguishable from a link stuck 
fault. 

Because of the characteristics of the ‘---" 
fault in Subcase F, it is impossible to further 
pinpoint the fault among each questionable 
switching element and link by applying tests on 
the input side and observing output on the out- 
put side. Hence there exists an ambiguity bet~ 
ween the link stuck fault and the Subcase F fault. 


Case 3 and Case 4: We can compute the swit<- 
ching element sets of the faulty paths and the 
intersections of these sets should lead to a 
unique faulty switching element. In these two 
cases, only four tests which are developed for 
the detection phases are required for locating 
the fault. In Case 3, the faulty switching ele- 
ment has two faulty outputs at one of the test 
phases, There are 18 fault types in Case 3, 
which are inner products es - 18598, 28, 5,5 


a 55,.S } x and {Ss S oF 8 Ss 


b 
oy Ce Since 192 ete 
ye ing bia can be uniquely identified by 


the switching element set of the two faulty paths 
in the same detection phase, the number of tests 
needed is equal to four, Two additional tests 
may be needed to differentiate $66 and --. In 
Case 4, there are 189 fault types and the fault 
type can be any one in the union of the inner 
eats sets of . 8), $2555, 5 Sc x 

S-S.} and {S- Ss) 8, Ss, 2S, 65 S 55.946 
where {S-S,_} is 10). set a the “Beatess sb a 


Table 1 excluding 5S, and {S-S, 4} the set excluding 


7? ie ie } 


S.,- Again the faulty switching element can be 
uniquely identified by the faulty switching 
element sets of the faulty paths in the detection 
phases. Furthermore, four, two or zero addi- 
tional tests may be needed to differentiate $0 
and --. 


The above discussion can be summarized in 
the following theorem. 


Theorem 7: 


The fault location and the fault type of 
Case 3 or Case 4 can be determined by at most | 
8 tests, independent of network size. 


The single fault diagnosis scheme is good 
under the assumption that the diagnosis procedure 
can be repeated in a reasonably short period 
during which at most a single fault could 
possibly occur. However, it is well known that 
many physical faults of a single logical circuit 
component cannot be represented as a single 
fault. 


IV. Detection of Multiple Faults 


Now we consider the detection problem for 
multiple faults. By a multiple fault, we mean 
the simultaneous occurence of any possible com- 
bination of single faults, 

In the single-fault detection problem, we 
derive tests for every stuck-type fault at the 
link and functional state fault in the switching 
element. For the multiple fault case, the test 
set derived for detecting single faults may 
fail to indicate the existence of the fault 
because some faults may be masked by some other 
faults, The faulty state which can mask a 
fault such that the fault becomes unobservable is 
called the masking faulty state, In valid state 
S1 , the masking faulty states are S_, S.,,, and 
S,, and in valid state S,, the masking faulty 
states are S 5 and The masking problem 
can be solved ¢ by using distinctive test vectors. 
Extending the solution to the whole network, we 
should use N distinctive test vectors for N 
terminals. The all<zero and all-one vectors 
should be excluded because these two vectors 
fail to test stuck-type faults at links. Hence, 
1+log,N binary bits are needed to form the test 
vectors for the multiple fault, Two test phases, 
similar to the two for detecting single faults, 
are also needed for detecting multiple faults, 
Concluding the above discussion we have the 
following theorem, 


‘Theorem 8: 


The number of tests for detecting multiple 
faults is equal to 2(1+log,N). 
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V. Conclusion 


In this paper, we have presented a fault 
model for the network in the class of multistage 
interconnection networks. Fault diagnosis proce- 
dures for the network constructed of switching 
elements with two valid states have been con- 
sidered. A diagnosis method for single faults 
and a detection method for multiple faults are 
developed. In the diagnosis procedures the con- 
trol lines of the switching elements in the same 
Stage can be grouped together and activated by 
the same control signal. The control line 
grouping of each stage is exactly the control 
scheme used in the flip network of STARAN [3]. 
Hence, the diagnosis procedures developed in this 
paper are good both for the indirect binary n- 
cube network and the flip network. Extension to 
the network constructed of switching elements 
with four valid states is feasible since the test 
sets for faults in switching elements with four 
valid states are the same as those we developed 
for switching elements with two valid states. 

The problem left is to design diagnosis proce- 
dures with minimal or nearly minimal number of 
tests. 


The number of tests which is required under 
various conditions in the diagnosis procedures 
developed in this paper is summarized as follows. 
The number of tests for detecting single faults 
is equal to four and is independent of the net- 
work size. The number of tests for detecting 
multiple faults is equal to 2(1+1og,N) where N 
is the number of terminal links in one side of 
the network. The number of tests needed for de- 
termining the fault location and the fault type 
of a single fault depends on the fault type and/ 
or the size of the network. The characteristics 
of single switching element faults are summarized 
in Table:6. The minimum number of tests needed 
for determining the fault location and the fault 
type is equal to four and the maximum max 
(12, 6+2[log(logN)1). For a network size 


Fig. 1 A switching element. 


N = 1024 the maximum is equal to 14. There 
exist four switching element faults (Subcase F) 
which cannot be pinpointed at the single swit- 
ching element level and those four are not 
distinguishable from the link stuck fault. This 
study provides specific information of fault 
characteristics for designing an easily 
diagnosable network. 
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Fig. 2 Fault-free response of a network to the test set. 


(a) Phase 1 test. (b) Phase 2 test. 
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Table 2 Faults, Test Inputs and Outputs 


Table 3 Faults, Test Inputs and Outputs 
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Table 4 Faulty Output Pattern in Case a 
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Table 6 Characteristics of Single Faults 


No. of Addi- 
‘tional Tests For 
Determining 

the Fault Type 


No. of Tests 
Needed For 
Determining 
the Fault 
Location 


Cases Fault Types 


EE PARC CPE G42 |10g5 log5N) | : 
Case 1: (s s_.) or 
One-Response 10’° 12 4+2[ log, (log.N) | 
Fault 
(Sp 585) > (S585) » (811255) 4+2 |Log, (1og,N) | 
(S555 2)3 (S44552)5(S 455,95 | or 2 
147757? 8"1977 472 8719774 442 Log, (ogyN)1 
| (S19°87) > (81995) 3) 
Subcase S12? 49)? 819753) grea) 0 
“ A (S,55,) 
@ 
fr ($1 925813) > (S31 9257)> (S455) 3)> 
2 (S3 555), (S14 >8j5)> (S153); 
a. Subcase ($1 4°54 9) (81,254) »(Sj9°8,). 2 
e B & C (S,5°5,)>(S,.8,).(S455)); 
9 (8, 58,.,)5 (8,58) 5 (8,58, 5) 
z De Dae eee eee 
(S353) 
3 (S555. 0) (Ons Do) 9 (S595 45) 
a pa ae eis 
w. 
8 (8g57) + (8455) + (841 95))5 0, 2, or 4 
SD 


($4 4,25,) 9 (S35 ,95,)+ (8,4 95)5)> 
S.),(S Ss.) 


Subcase 
D&E 


Oi? 14°°13)+ S14) 


2 
o | (cannot be located |(not distin- 
pe 9 ee ccotih 
& ee ae (S598) (S584) >(Sg58,)> - at the single guishable 

(S,,S.) switching element from a link 
87° 1 
| stuck fault) 
os 3: {85 »8+5,55, 25655728915, 458,55 | 

Sepavaced x {s.}3 {$19} x 18 4 0 or 2 

Two- Response {8p 985986 08g989 08442544284, 25845) 

Fault 


} 


Case 4: (S25. 3S\0S.55.49.55 45. 225 

Multiple- ee . ce ,! - 13° 15 189 4 0, Zi or 4 
- Response ae 10 

nner {89285586 >Sg+8g 58492841251 4>515) 
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Abstract 


The advent of microprocessors as low cost 
general purpose computing elements has made fea- 
Sible the implementation of multiprocessor sys- 
stems containing a large number of cooperating 
modules. In these systems an early diagnosis of 
faulty modules will become 
tant. Previous models of system level fault dia- 
gnosis studied this 


that the normal operation of the system is inter- 


increasingly impor- 


have problem considering 


rupted and the diagnostic phase is started in 
the whole system. 


When the number of modules in the system is 
large it is unlikely that all of them are busy 
Therefore it may 
"slack" 


diagnostics. 


with computation at all times. 
be possible to utilize this by having 
the His g) 
this way computation and diagnosis are performed 


non—busy modules perform 


concurrently in real time. In this paper we con- 
sider the problem of designing systems in which 
concurrent computation/diagnosis is possible and 
we evaluate the potential of such approach. 


A generalization of previous models of self- 
-—diagnosis to take into account the concept 
of busy modules is introduced. For systems whose 
interconnection structure is defined by a spe- 
cial class of regular graphs,the relation bet- 
ween the degree of parallelism P, the maximal 
number of allowed faults t, the outdegree of 
each module in the system L and the number of 
without inter- 


modules n, is derived such that, 


rupting any busy module, it is possible to iden- 
tify the 


(either faulty or non-faulty) simultaneously in 


status of all the non busy modules 


l-step. Optimal strategies for scheduling the 


assignment of modules to be busy (i.e. assign- 
ment of tasks to modules assuming homogeneity) 
among the n modules of the system are given. 
Finally the characteristics of optimal connec-— 


tions for concurrent computation/ diagnosis are 


279 


20052 


derived. 
Introduction 


The advent of microprocessors as low cost 
general purpose computing elements has made fea- 
Sible the 
processor systems containing a large number of 


implementation of homogeneous multi- 


cooperating and communicating modules..Such sys- 
tems have many potential performance improve- 
ments including increased availability via fault 
diagnosis and reconfiguration leading, to a poten- 
tial graceful degradation should failures occur. 
To this 


modules will become increasingly important. 


end system level diagnosis of faulty 


Previous models of system level diagnosis 
[1] P| : [3] have studied this problem consi- 
dering that the normal operation of the system 
interrupted when the diagnostic 


is completely 


phase, which occurs periodically, is started in 


the system. When the number of modules in the 
system is large it is unlikely that all of them 
will be at all 


Therefore it may be possible to utilize 


busy performing computations 
times. 
this "slack" by having the non—busy modules per- 
form diagnostics concurrently with computation 
by the busy modules. In this way computation 
and diagnosis are performed concurrently in real 


time. 


The problem of concurrent computation and 
diagnosis has been introduced in [4] and has 
been studied for the case in which, without in- 
terrupting any of the modules which perform com- 
putation, it is possible to identify the status _ 
of at least one non busy module, therefore being 
able to start a sequential diagnosis in the sy- 
stem, 


by several repetitions of testing and reconfigu- 


in which the faulty modules are identified 


ration or repair. 


In this paper we consider the problem of 
designing systems in which concurrent computa- 


tion and diagnosis is possible and it is requir- 
ed than the status (faulty or fault- 
all the modules not involved in computation can 


free) of 


be identified by a single application of tests 
between these non busy modules. This is called 
1-step diagnosis. | 


In Section 2 we introduce a generalization 
of previous models of system level diagnosis to 
take 
In Section 3 for a special 


into account the concept of busy modules. 
class of systems 
whose interconnection structures correspond to a 
type of regular graph (which are relevant from 


(s]) the 


relation among the degree of parallelism in the 


the point of view of easy diagnosis 
System, the maximal numbers of allowed faults, 
the outdegree of each module in the system and 
the 
concurrent diagnosis. 


number of modules is derived for 1-step 
Optimal strategies for the 
assignment of computation tasks to modules are 
given. Finally in Section 4 we show that optimal 
concurrent computation/diagnosis is strongly de- 
pendent on the interconnection structure of the 
System and we derive the characteristics of opti- 
mal connections for concurrent computation/dia- 


gnosis. 


Model for concurrent computation/diagnosis 


Following the Preparata Metze Chien model 
of system level diagnosis C1] , we.represent a 
self-diagnosing system as a graph G(V,E) in 
which the set of nodes v, |v|= n, corresponds 
to a set of homogeneous processor elements and 
the set E of arcs, |E|= m, represents the set 


of interconnections used for testing among the 


modules of the system. The test outcome is 
assumed to be binary: a "0" outcome of v. 

i. 
testing v. means that v., has judged v. as 


i 
fault-free) and a "1" outcome means that v. 
has judged v. as faulty. Each module in G(V,E) 
when that 


module has been assigned a computational task, 


may be in one of two states: "busy" 


or "non busy" when that module is not perform- 
ing a computational task and can be assigned 
to in the 
This in the diagnostic graph 
G(V,E) 
and the incoming and outgoing arcs from these 
The subgraph SAV SE'), Vic. ElGis 


which is obtained represents the part of the 


perform diagnostic tasks system. 


is represented 


by the removal of all the busy modules 
modules. 


system in which diagnosis is performed. 


The problems which arise by introducing 
this generalized model are the following: 


- The subgraph S_(V',E') must have certain 


properties required for diagnosis in order 
that all failures may be accurately diagnos-— 


ed concurrently with the performance of com- 
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putational tasks of the busy modules. 

- We must consider the complexity of schedul- 
that is that part which as-— 
Signs computation or diagnostic tasks to mo- 


ing algorithm; 
dules. Two types of scheduling system are 


considered: a scheduling system which as-— 


Signs the modules to be busy with random 
strategy, and an intelligent scheduling sys- 
tem which assigns the modules to be busy in 
an optimal way. The two types of scheduling 
systems allow the determination of the great-— 
and of the 
bound on the degree of parallelism (percenta— 
ge of 


tasks) which is possible in a system with n 


est lower bound, greatest upper 


modules involved in computational 
modules and consistent with the requirements 
of concurrent diagnosis. 

- Finally we have to derive optimal systems 
that is systems in which we can maximize the 
number of busy modules (maximize the degree 

of parallelism in the system), while mantain-— 

ing the maximal diagnostic capabilities in 
the rest of the system. 


Concurrent computation/diagnosis in Dit 


systems 


In this section we consider the problems 
previously introduced for a special class of 
homogeneous multiprocessor system. 


tems 


These sys- 
in 
which each module is connected to L other modu- 


have an interconnection structure 


les. If the modules are labelled O, 1, ..., 
n-1, the interconnections from module i go to 
i+l, i+2, ..., i+L (where these module numbers 


are modulo n). Such system, called D sys- 
tems, have been shown to have useful diagno- 
stic properties[5]. Figure 1 shows a D 


tem with n=8 and L=3. 


SYS- 
tie 


Fig. 1 


Concurrent 1-—-step diagnosis in D. Systems 


For these systems, we consider the case 
in which it is desired that, after the removal 
of the B busy modules, the subgraph S_(V', 
E'), V' = n-B is l-step diagnosable (i.e. all 


faulty modules be identified simulta- 
neously by a single test application assuming 
the number of faulty modules is bounded by a 


constant t). 


can 


Definition: A system S is concurrently 1-step 
if, after the removal of the B 


it is possible in 1-step to iden- 


diagnosable 

busy modules, 
tify the status of all the n-B remaining modu- 
les, which take part in the diagnostic phase, 


without interrupting any of the busy modules. 


If the scheduling system adopts a random 


strategy in assigning the busy modules (i.e. 
assigning computational tasks to modules), 
Theorem 1 specifies the upper bound on the 


value t (the maximum number ef faults which 
can always be identified in 1-step) in the 
subgraph S(V',E'). 

Theorem 1: In Die systems it is possible to 


arbitrarily assign B busy modules, 1€B§L-1, so 
that the subgraph which remains after the remo- 
val of the B modules and associated arcs is at 


n-bD=—- 
least t ={min L-B,| 5—jlrauit diagnosable in 
1-step. 


and the associated 
each of the remaining nodes is tested by 
at least L-B other nodes. Relabelling the n-B 
remaining nodes from O to n-B-1l in increasing 
order, the subgraph S(V'E') corresponds to a 
system. Since D 


Proof: Removing B nodes 


arcs, 


Dict 5 systems have been 

shown to be L fault diagnosable in 1-step, if n 
n- _ 

22L41f/1,6], it follows that if L-B <¢|! 7 | the 


subgraph is t=L-B l-step fault diagnosable, if 
n—-B-1 n—-B-1 
L-B >| the system is t= [ 2 | fault 


diagnosable in 1l-step. 


The following example shows that in D 
systems, if BYL the diagnostic subgraph obta- 
ined may not be diagnosable. 


a D 
If B=3 and modules O,1,2 are assign- 


Example: For n=8 and L=3, 


in Fig. 2a. 


system is shown 


ed as busy, then S(V'LE') is as shown in Fig. 
2b. 


Fig. 2 
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For this graph, node 3 is not tested by 
any other nodes and hence a fault in this node 
cannot the diagnosed. Consequently the original 
system is not concurrently 1-step diagnosable 
for B=3. 


We can derive a greatest lower bound on 
the degree of parallellism for D systems assu- 
ming arbitrary assignment of busy modules where 
the degree of parallelism P is the percentage 
of (busy) modules which are assigned to computa; 
tional tasks.The measure of P is given by P= 
100%. 
between P and L for a given n and parameter t, 


the 


The graph in Fig. 3, shows the relation 


which permits concurrent diagnosis, in 


worst case of assignment of busy modules. 


Le. 
nal tasks to modules is employed the degree of 


"intelligent" assignment of computatio- 


parallelism P consistent with concurrent diagno- 
Sis can be greatly increased. 


t=] 
bo E22. 
| 
! b=3 
| tel 
° a a a a we 


Fig. 3 


Theorem 2: In a D system, if nyeL+l1 and k is 
a positive number such that kL¢n<(k+1)L, for 
any t, lS taL, it 
B=k(L-t)+a& busy modules, (where K=0 if n-kL-t¢ 
O, otherwise & =n-kL-t), in such a way that the 
diagnostic subgraph is 1-step t-fault diagnosa- 
ble. 


is possible to assign 


Proof: The proof is given through a constructi- 
ve procedure for assigning B busy modules. If 
the nodes are labelled O through n-1l, assign 
nodes O through t-1l as non busy and the nodes 
t...L-l as busy; repeat this basic process k ti- 


This 
nodes where 


mes. leaves a total of n-kL unassigned 
3$n-kL <L. 
n-kL nodes are assigned as non busy otherwise, 
if n-kL >t, 


ed as non busy and nodes kL+t through n-1 are 


If n-kL¢t the remaining 
nodes kL through kL+t-1 are assign- 


assigned as busy. With this procedure the n-B 
non busy nodes may be relabelled from O to 
n-B-1 in increasing order. The resulting diagno- 
stic subgraph S_(V'E') is a D system which 
from [1, 6] is 1-step t-fault diagnosable. 


Theorem 3: In De systems, if Lilgn¢e2l, itis 
possible to assign B=n-2t-1 busy modules in 


such a way that the system is concurrentiy 1- 


step t-fault diagnosable. 


Proof: This proof is also given through a con- 
structive procedure which defines the schedul- 
ing algorithm of computational tasks to modules. 


We assume that the nodes are labelled O 
through n-1 and determine B and t. 


If L-t 2B, assign nodes O through B-1 as busy 
and assign the remaining n-B nodes as non busy. 
If L-t @ B, assign nodes O through L-t-1 as 
assign nodes L-t through L-1 as non busy, 
assign nodes L through n-t-2 as busy and nodes 
n-t-1 through n-1l as non busy. 


busy, 


In both cases 
relabelling the n-B remaining nodes from O to 
n-B-1 in increasing order, the resulting sub- 
graph S_(V',E') is a D_|, design which from [1] 
[6] is 1-step t-fault diagnosable. 


Example: Consider the system D with n=15 and 
k=3 and A=1. The diagnostic sub- 
graph obtained by the application of the proce- 


dure in Theorem 2 is shown in Fig. 4. 


assume t=2, 


{to 


Fig. 4 


As a second case consider the D system 
with n=10 and assume B=5 and t=2 (L-t<B). From 
the procedure of Theorem 3, the subgraph which 
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is obtained is shown in Fig. 5. 
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Fig. 5 


The degree of parallelism P as a function 
of L for n=10 and several values of t for this 
intelligent 
Fig. 6. 


is shown in 


scheduling algorithm 


bea 2 | 


Oo 
re) 
Ww 
x ma 


5 6 t 8 8 L 


Fig. 6 
The graphs in Fig. 3 and Fig. 6 depend 
strongly or n. If we consider n=100, L=2 and 


t=1 the degree of parallelism which is obtaina- 
ble with any possible assignment of busy modu- 
les drops to 1% while with an intelligent assi- 
gnment it remains at about 50%. 


The scheduling algorithm described in Theo- 
rem 2 and 3 can be shown to be optimal for D 
systems and therefore the graph shown in Fig. 6 
represents the greatest upper bound on the de- 
gree of parallelism which can be obtained in 


De system with concurrent l1-step diagnosis. 


Optimal connections for concurrent computa— 
tion/diagnosis 


In the previous section we have examined 
the characteristics of D systems which enable 
In this 
section we consider the problem of determining 
which 


for concurrent computation/diagnosis. 


concurrent computation and diagnosis. 


interconnection structures are optimal 


Definition: An interconnection 


concurrent 


structure is 
computation/diagnosis, 
if it is possible to assign B busy modules, 
Bg¢n-3, in such a way that the system 


n 
remains 1-step t-fault diagnosable with 7 ae 


optimal for 
for 


any 


This definition of optimality corresponds 
to the statement that it is possible to assign 
the maximal number of busy units so that the 
diagnostic subgraph s is maximally l-step dia- 
gnosable. 


Theorem 4: For L+lgng2L+l1, D 
optimal 


systems define 
interconnection structures for concur- 
rent 1-step diagnosis. 


Proof: From Theorem 3, for L+l Eng 2L, it is 
possible to choose B=n-2t-1 busy modules and 


the diagnostic subgraph obtained by, _ Removing 
the B busy modules is i1-step t 7, -fault 
diagnosable. For n=2L+1, from Theorem 2 we 
have k=2, A =O and B=2(L-t) and therefore we 
can assign B busy modules so that_the remaining 
diagnostic subgraph is l1-step {= fault dia- 
gnosable. 


Dt interconnection structures for n=2L+1 


are of particular interest for concurrent 
l-step diagnosis. Let us consider a aA system 
for n=9. 

The complete system (i.e. B=0) is 1-step 


4-fault diagnosable. If we apply the procedure 


in Theorem 2 we get for t=1,2,3, the diagnostic 


subgraphs shown in Fig. 7a, 7b, 7c, respecti- 
1 ich tively D.., -D d D 
vely, which are respectively 14 13 an 13 
systems. 
6 0 
8 an g 1 
4 
0) O2 
7. om) 
03 
60 Te) ae 
2) 
Gy 
5 3) 5 &  b) 
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Fig. 7 


In general for n >2L+1, D Systems are 
not optimal in the sense that it is not always 
possible to assign the maximal number of busy 
modules so that the remaining subgraph is maxi- 


mally 1-step fault diagnosable. 


Example: Consider a system with n=17 and L=3. 
For the Die 
i is connected to i+l, 


interconnection structure the node 
i+2 and i+3 (mod. n)for 


i=-O,...,n-l1. The application of the procedure 
in Theorem 2 for t=1 results in the diagnostic 


subgraph shown in Fig. 8a 


In this case the degree of parallelism is 
only 64.7% (B/n = 11/17) and the subgraph of 
Fig. 8a is only 1-step 1-fault diagnosable. No 
other assignment of more than 11 busy modules 
for the ae 
diagnosis on the remaining diagnostic subgraph. 


system enables concurrent 1-step 


Now consider another system with L=3 whose 
interconnection defined by i-i+1,i+5,i+8 (mod. 


n), i=0O,..., 16. In this case 14 busy modules 
can be assigned as shown in Fig. 8b. The 
degree of parallelism becomes 82.3% and_ the 


subgraph in Fig. 8b is 1-step 1-fault diagnosa-— 
ble. 


or 
Iso O4 
12 Og 
1) O 6 
10° ‘3 ae 
9 ; a) 


Ib 2 
} 
15 fo) 
O 
O L 
1h 
oO 
o3 
Es 
Cu 
12.0 
Os 
1 0 
O 
10 ° 4 
0 i] 
9 8 b) 
Fig. 8 
The example show§ that for n > 2L+1 D 
systems are not necessarily optimal with re- 


spect to concurrent computation/diagnosis. 


The following theorem determines an optim-—- 
al 
l—~step diagnosis. 


interconnection structure for concurrent 


Theorem 5: The systems defined by the following 
interconnection structure are optimal for con- 
there 
interconnections from module i 


current i1-step t-fault diagnosability: 


are up to 2t 


defined by i — i+ A&A (mod. n) i» i+[7 a 
(mod. nj), for Ogign - 1, A= 1,2, .--, t. Nodes O 

through t are assigned as non busy, together 
with the nodes |7~95— through | 5}. All the other 


nodes are assigned as busy. 


Proof: Let us refer to Fig. 9. 


n=) ¢ 4 
0° 6 ; 
5 [ate a O b 
[P]rt° Obed 
O bry 
| ae eee 
PS) pel 


Fig. 9 


If n is odd 2t+1 nodes are assigned as non 
busy. If n is even 2t+2 nodes are assigned as 
non busy. We show that each non busy node is 
tested by t other non busy nodes. We show this 
for n odd; the case of n even is quite similar. 
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For n odd, the nodes 0 ig t-1 are tested by 


the i-1 non busy nodes 0 through i-2. They are 
n+ n+ 
also tested by the nodes 1+79- -1 through a ae 


n 
t =lal++- These nodes are non. pusy Since i+ 
+ = . 
= “13-9 (171): and i4 "3 =lé—-y- 4+ t= 1 (fé 
)4 


The total number of non busy nodes which 
test nodes O <i €t-1 is = +t-l1l-i- net + 
~ 1 = t.The node i= t is tested by 
the non busy nodes O through t-1. 


n+1 n-1 
The nodes <iS-y + t are tested by the non 


+ 
busy nodes ~y— through i-2, and by the nodes i- 


1+i1+4+i 


> -l1 through t. These last nodes are non busy 
. _ nel oa = Ol ace ' n-1 
since i--j~ ~ 130 (i237) andi - “3, = Eh 
n-l 2 n = 
G6 Fhe Dee] |e - te A) The: Rotel number 
n+ 
of. non busy nodes which test the nodes 2» sis 
n—-1 . ; n—- ; n+ 
z +t is t - i + ar a 1+4t1+i1i-2- 27 + 1 
= t. Therefore each non busy node is tested by 
Relabelling the 2t+l1l 


diagnostic 


t other non busy nodes. 
of the 
O to et in increasing order 
system which from [1] and 6] 
is l-step t-fault diagnosable. 


remaining nodes 
5 (V',E") 


results in a D 


subgraph 
from 


The proof for n even is similar. 


In both cases these connections are optimal 


with respect to the degree of parallelism. 


Example: Consider the system with n=17, L=6 and 
interconnection structure defined by the connec- 
i-+yi+i, 
For t=3 the diagnostic subgraph resulting 


tions from node i, 142% 1435. 1465: 147; 


i+8. 
from the assignment of Theorem 5 is shown in 


Pipe LO, 


This subgraph is a Di 3 system with 7 nodes. 


Similarly if n=14, L=6 and t=3, the inter- 


connection structure is defined by i i+1, i+2, 


1+4, 14+5, i+6, 


TAs 


LS and the system is shown in 


Fig. 


sis for systems with D interconnection struc-— 
ture and other optimal interconnection structu-— 
res. The worst and the best assignment of busy 


module patterns are also summarized. 


Connections Random scheduling Worst busy 


(n 


Fig. 11 


The 
with 8 nodes. 


diagnostic subgraph is a Di 3 system 


The interconnection structure described in 
the Theorem 5 results in concurrent 1-step dia- 
gnostic systems with degree of parallelism is 
given by P=p— 100% for n odd (P=——~po— 100% 
For n=100 the value of P as a 
(fOr steak, 2s 224 


for n even). 


function of t 10) is shown in 


lism obtainable with concurrent l-step diagno- 
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ashe = n 


i-itl, i+2, 


module 


pattern 


< 2L+1) ( L« n-t-l B consecu- 


tive 


: i n modules 


Optimal Scheduling Best busy 
module 


pattern 


Fig. 12% 
p 
p= KG-t)49 joy 
n 
a=0 n-kL-t< 0 Alternating 
a=n-kL-t otherwise chains of t 
kL<en <(ktlL)L non busy mod. 
n> 2L+1 and L-t busy 
modules 
P= noite 100% See proof 
L+l<ng2L of Theor. 3 | 
N= [00 
| aes 
Optimal 
40 eee n odd 
| ‘Connections ae Nodes tt2,... 
Pat 100Z .., {ntl and 
* | (n > 2L4+1) i "L2 
| [5] +e+2, pene 
| ~- 
r n even o.e.,n are busy 
I> 1it+a 
teat || go peBick = nog 
1 Q 3 4 Ss G 2 8 q 10 & o= 1,2, ae 
Fig. 12 
Table 1 summarizes the values of paralle- 
Table 1 


Conclusions 


In this paper we have considered the pro- 
blem of designing homogeneous multiprocessor 
systems in which concurrent l1-step diagnosis 
is possible. For systems, whose interconnec- 
tion structures are defined by regular graphs 
(called ee systems), lower bounds and upper 
bounds on the maximum number of modules involv- 
ed in computation (degree of parallelism) have 
been derived and strategies for the assignment 
of computational tasks to modules have been 
given. Optimal interconnection structures 
which enable concurrent 1-step t-fault diagno- 
sability have also been presented. Some more 
work is needed to take into account the fact 
that in general required connections between 
busy modules may be constrained by the algori- 
thms to be executed in the system. A more 
integrated study of the constraints and the 
tradeoffs required by computation and diagno- 
sis will be needed in order to more fully 
exploit the potential performance improvement 
of multiprocessor systems. . 
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RELACS, A DATA BASE COMPUTER 


Ellen Oliver*and P. Bruce Berra 
Dept. of Industrial Engineering and Operations Research 
Syracuse University 


Syracuse, New York 


summary 


RELACS, Relational Associative Computer System, is a 
backend data base computer designed to support a rela- 
tional data model. As shown in Figure 1, there are six 
main functional blocks: the Global Control Unit (GCU); 
the Data Dictionary Processor (DDP); the Associative 
Query Translator (AQT); and the Output Buffer (OB). 


Queries to the data base originate with the user and 
are preprocessed by the host before reaching RELACS. 
The preprocessor in the host computer parses the query, 
performs syntactical functions, and determines the at- 
tribute and/or relation search order such that the logi- 
cal relationship(s) of the query is maintained. When 
the sequence of query and control instructions required 
for execution is created, it is stored in an area of the 
host computer memory known as the Transfer Users Memory 
Area (TUMA) [1]. The TUMA may be implemented as a first- 
in-first-out queue of jobs or it may include the more 
sophisticated capability to rank jobs by predetermined 
priority criteria. In either case, once a job (i.e., 
query) is placed in the TUMA, the host will notify the 
GCU of RELACS. 


The first step in processing a query is to utilize 
the DDP [1] which performs data dictionary and data 
directory functions. The DDP is envisaged as a group 
of associative array memories interconnected by arrays 
of specially designed cells. When processing a query, 
the DDP will first establish whether or not the attri- 
bute and/or relation names exist in the data base. If 
they do, the DDP will verify t* ~ security and access 
privilege levels of the user for the specific attributes 
and/or relations required. For the case where multiple 
relations are required, the DDP will establish whether 
or not a join can be constructed between the relations. 
Assuming that the user has the correct passwords, the 
attributes and/or relations exist and the logical re- 
lationship(s) can be constructed, descriptor data for 
all attributes referenced directly or indirectly in the 
query are output to the AQT. 


The AQT performs the function of translating the ori- 
ginal sequence of instructions from the user and the 
descriptor data from the DDP into a set of instructions 
executable by the AUs. Implementation of the AQT would 
include associative array memories which contain the . 
query and descriptor data required for translation, a 
RAM which stores all relation descriptor data and a 
storage stack or buffer which contains the AU instruc-— 
tions as they are created. 


The AUs access the AQT to retrieve a set of instruc- 
tions which will enable the AU to access the MSD to re- 
trieve the required relation, to perform the search and/ 
or update specified by the query and to execute the 


* Currently at Bell Telephone Labs., Holmdel, New Jersey. 
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necessary Output. An associative array memory is the 
heart of the AU which also includes a comparand array 
for multiple search arguments, a response array, an 
output buffer and I/O capabilities. The output from 
the AU will be either to the MSD (update), to the user 
via the OB and host computer, or to the other AU (join). 


The MSD is a large capacity (> 10° bits) storage de- 
vice with speed and bandwidth requirements compatible 
with the AU. In addition, the storage structure is a 
row-column matrix of cells which are addressed as blocks. 
Each block is equivalent to the associative array memory 
size. 


The OB is provided for those queries which require 
output from more than one relation. In general, the 
tuples from one relation will be fully searched by one 
of the AUs; therefore, when the tuples from more than 
one relation are searched, there will be output from 
more than one AU. The OB provides intermediate storage 
between the AUs and the host computer. In addition, it 
can provide the capability for collating tuples and/or 
formatting of data according to the user requirement. 
From here, the data are returned to the user via the 
host computer. 


The RELACS system was designed with the objective 
that it support a relational data model yet overcome 
the weaknesses of earlier systems, namely, the I/O 
bottleneck due to the relatively slow I/O with respect 
to faster search times, and the requirement that the 
entire data base be searched at least once for each 
query. The DDP provides the capability to precisely 
locate the relations which must be searched to process 
the query. I/0 time is minimized by utilizing a custom 
designed MSD which interfaces with the AU such that 
rapid, parallel data transfers take place. In addition, 
the AUs provide the capability to perform content based, 
parallel search operations and a facility to enable the 
join and/or projection operations. 
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ABSTRACT -- Parallel algorithms are described for 
recognizing parabolas and conics on bus automata. 
The parabola algorithm implements a conformal 
transformation by means of which recognition is 
similar to that of a straight line. It is 
capable of generalization to other curves. The 
conic recognition algorithm exploits geometric 
properties of conics, recognizing all kinds, but 
does not generalize as the properties are defini- 
tive for conics. The procedures and BA architec- 
ture are given in detail for conceptually 
important cases. 


I. INTRODUCTION 


In a paper given at the 1978 Conference on 
Parallel Processing [1] it was shown that bus 
automata, (BA), which are cellular automata 
augmented by locally controllable intercellular 
communication channels [2,3,4,5], could recognize 
patterns by parallel algorithms. Both topological 
and metric features could be recognized. Recog- 
nition of straight lines [6] and their images 


under conformal transformations was also discussed, 


the first having been solved earlier while the 
second depended on implementing a global coordi- 
nate transformation. Preliminary discussion was 
given in [1] of the special case of parabola 
recognition using this approach. This paper 
gives an explicit algorithm for doing it, 
including the all-important coordinate transfor- 
mation procedure, and the design of a BA which 
implements it in parallel. It uses both geometric 
properties of the parabola and a particular 
conformal mapping taking straight lines into 
parabolas. A second algorithm was devised which 
recognizes all conics immediately, using only 
geometric properties common to the conics. It is 
both more and less general than the preceeding 
algorithm, for parabolas are a special case; but 
the conformal map method is not restricted to 
parabolas, or even conics in general. 


Both algorithms are particular cases of 
powerful general methods. This is clearly the 
case for the conformal approach, but the second 
case is not so obvious. It exploits a particular 
geometric property. Geometric properties can be 
specified in many ways, one useful way being 
those invariant under some group of transforma- 
tions. Possession of such a property is often 
the definition of a class of geometric objects. 
The second algorithm uses a property of the set of 
diameters of a conic for immediate recognition. 


Both kinds of algorithms are included in the 
general theory of [1], but the subtle interplay 
of the metric and topological there noted needs 
many examples for its clarification. An impor- 
tant object of this paper, over and above the 
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intrinsic interest of the algorithms themselves, 
is to provide such examples. 


In the following sections, we refer to the 
first algorithm as the Conformal Parabola 
Recognition Algorithm, and the second as the 
Conic Recognition Algorithm. In Section II, we 
discuss the theory on which each algorithm is 
based. Sections III and IV present the algor- 
ithms and their BA architectural embodiments. 
Section V has some comments on the error 
involved in BA algorithms. 


Il. THEORY 


In this section, we develop the theory 
required in the design of the two algorithms. 
First, we consider Conformal Parabola Recognition. 
The conformal map is discussed as are certain 
properties of the parabola which allow us to 
locate the map's singularities. An overview of 
the algorithm is given, and the discrete approxi- 
mation used to represent the conformal parabolic 
net is considered. Next, we take the Conic 
Recognition Algorithm. Properties of diameters 
of conic sections are noted, and the flow of the 
pursuant algorithm is presented. 


1. CONFORMAL PARABOLA RECOGNITION 


Our notation follows [1] except that we 
interchange the roles of w and z to avoid the 
nuisance of dealing continually with the inverse 
of the function f(z) defined there. The conformal 
transformation 


/2 


estos (1) 


takes the cartesian coordinate system (u,v) where 
w= ut iv (2) 


into the confocal, coaxial, parabolic coordinate 
system shown in Figure 5.2 of [1] in the z-plane, 
with cartesian coordinates (x,y) given by 


z=xt+iy. (3) 


The two families of parabolas all have the 

origin as focus and the x-axis as their symmetry 
axis. The two families are open to the -x 
direction in one case, to the +x direction in the 
other, and every member of one family is ortho- 
gonal to all members of the other [7, p. 564]. 


The parabola recognition strategy uses the 
following intuitive notion. Let the putative 
parabola be drawn in the z-plane. If we find its 
focus and axis of symmetry, we can then translate 
the origin to the focus and rotate the coordinate 
system so that the new x-axis lies along the 
axis of symmetry. In this new coordinate system 
the putative parabola is "parallel" to the 


parabolas of one family; it either coincides with 
one of them or lies between two of them, not 

intersecting either. 
is either a coordinate line or a straight line 
lying between two adjacent parallel coordinate 


lines which never meets them, i.e., it is parallel 


to them. To within the resolving power of the 
grid this is a straight line, for wiggles within 
a cell are undetectable. Parabola recognition is 
a transform of straight line recognition. 


This motivates the following general 
approach. The z-plane is represented by a data- 
plane consisting of cells, one to a coordinate 
square. The putative parabola is stored in bits 
in the states of cells traversed by it. Control 
borders and planes are used, in addition, for 
help in carrying out the four main steps of the 
algorithm. These are: 


@ 
1. Locate the directrix of the parabola. 
2. From it, orient the cartesian coordinate 
system, and find the focus. 
3. Generate the adjacent grid parabolas. 
4. Recognize. 


The major problem is step 3, which is the 
embodiment of the conformal transformation (1), 
essentially. It is done immediately, with the 
present algorithm, corresponding to "propagating" 
the useful portion of the parabolic coordinate 
system out from the focus, which is a singular 
point of mapping (1) (branch point). Step 4 is 
trivial, after step 3 is done. 


Finding the directrix depends on the follow- 
ing property of the parabola. If two tangents to 
a parabola meet at right angles, their point of 
intersection lies on the directrix [8, p. 129]. 


Step 1 uses a pair of right angles (represented by 


busses) each of which is translated until it 
touches the parabola in two points. The straight 
line through the corners is the directrix. 
lating a line parallel to the directrix until it 
touches the parabola locates the vertex of the 
parabola. The vertex lies along the axis of 


symmetry (which is perpendicular to the directrix) 


half-way between directrix and focus. Step 2 
thus finds focus and axis of symmetry from input 
data and directrix. The rotated coordinate 
system has the symmetry axis, and the line 
through the focus perpendicular to that axis, as 
its new x and y axes. 


The geometrical basis of step 3 is as follows. 


The parabolic net in the z-plane is the image of 
the integer net (u,v) in the w-plane under the 
transformation 


Zz = > Gz) = ae 


As we saw above, the coordinate lines u = const. 
and v = const. go over to confocal coaxial para- 
bolas with the x-axis their common symmetry axis. 
We approximate a net parabola by a chain of 
chords connecting successive lattice points on 


the parabola (starting from the vertex and follow- 


ing the +y branches to infinity). To find the 
slopes of these chords for the parabola deter- 
mined by v = v, we compute the change in x and y 


as u varies from US to uo + 1. The coordinates 


Its pre-image in the w-plane 


Trans- 


(4) 
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of its vertex as (-v*,0), by (4) (here w=o0+t+ 


iv). We have (where w = u_ + iv ) 
O oO O O 


aCe ane £7) = (Qu +1) + i(2v.) (5) 


Along this parabola we obtain constant increments 
in y (2v or on the negative branch, 2v)> and 


successive odd integers in x. (If uy is fixed 


and v varies, we get the other family, with y 
increments +2u, and the parabolas open up in the 


opposite direction.) The slopes of consecutive 


chords are 
2v /1, av /3, av /55 Siri 2v/(2n +1), ... (6) 


in the upper half-plane, the negatives of these 
values in the lower half-plane. 


Plotting the adjacent parabolas in step 3 
thus involves successive chord construction once 
the vertices have been located. Step 4 checks 
whether the putative parabola falls between the 
adjacent ones (or coincides with one of them) 
or not. 


24 GENERAL CONIC RECOGNITION 


The Conic Recognition Algorithm uses pro- 
perties of sets of parallel chords. The 
algorithm also distinguishes between ellipses, 
parabolae, and hyperbolae including degenerate 
or limiting cases. The equation of a conic in 
polar coordinates (r,®) is given by 


r = 2/(1 - e cos 6) 


(7) 


where e is the eccentricity, %/e is the distance 
from focus to directrix, the origin is at the 
focus, and the ray 8 = o lies along the axis of 
symmetry. The case e = o is a circle of radius 
2 (the directrix is the "line at infinity"), the 
range o < e < 1 corresponds to ellipses, e = 1 
is a parabola and e > 1 gives the hyperbolae. 
The circle is a degenerate ellipse, the parabola 
is a limiting case of both ellipses and hyper- 
bolae, two parallel straight lines are a 
degenerate parabola, and two intersecting 
straight lines are a degenerate hyperbola. The 
conics have two foci, coincident in the case of 
the circle, with the point at infinity one focus 
of the parabola, and with the midpoint of the 
line joining the foci called the center. Only 
when the foci are separated by a finite distance 
is the center not at infinity. In that case the 
center is a center of symmetry and the conic is 
called a central conic. Straight lines through 
the center which intersect the conic are called 
diameters. Two diameters d,> d, are conjugate 
if their slopes (in a coordinate system (x,y) 
with origin at the center, x-axis along the 
symmetry axis, y-axis parallel to the directrix) 


m, and Mm, satisfy 


2 
mm, =e - 1] 


(8) 


An important theorem is the following. 


Theorem 1 The locus of midpoints of chords 
parallel to a diameter of a central 
conic is the conjugate diameter. 


For the parabola, this theorem becomes: 


Theorem 2 The locus of midpoints of a set of 
parallel chords of a parabola is a line 
parallel to the symmetry axis. 


The above theorems and (8) suggest the follow 
ing general algorithm. Choose slope m arbi- 


trarily, covering the finite grid with lines of 
this slope. Find the midpoints of all segments 
with endpoints in the locus to be recognized. 
any line is found to cut the locus more than 
twice, here or later, the locus is rejected, 
since no line cuts a conic in more than two 
points. (If it cuts it once, say lines parallel 
to the symmetry axis of a parabola choose a 
different value for m, +) If this is not the case 


Lf 


and the midpoints lie on a straight line of slope, 
Say, My, then the grid is again covered, this time 


with lines of slope m Midpoints are determined 


Jd 
as before. If the new locus is a straight line 
of slope m, > the locus is a central conic. For a 


parabola midpoints are at infinity as the lines of 
slope mM, cut the parabola at only one point. 


Complete justification of the algorithm re- 
quires the following. If a symmetric smooth 
curve (which can have two disconnected portions, 
i.e., the connection is by way of the point at 
infinity in this case) satisfies (a) lines through 
the center define diameters by their intersections 
with the curve, (b) when tangents to the curve 
are constructed at those points of intersection 
they are parallel, (c) the line through the center 
parallel to the tangents defines another diameter 
(said to be conjugate to the first one) and (d) 
the slopes of the diameter and its conjugate are 
related by eqn. (8), then the curve is a conic. 
The proof requires only elementary ananlytical 
geometry and calculus, and is omitted. 


To distinguish the parabola from the others 
we use at least two values for slope m,- If the 


lines determined by the corresponding chord mid- 
points are parallel, then the conic is a parabola. 
The degenerate parabola consisting of a pair of 
parallel lines occurs if and only if the two lines 
of chord midpoints coincide. In the non-parabolic 
case, the lines intersect at the center. The 
conic is a hyperbola if the center is in an open 
region of the plane and an ellipse or circle if it 
is in a closed region of the plane. In the open 
case, the degenerate hyperbola consisting of two 
intersecting straight lines occurs if and only if 
the center is on the locus being recognized. It 
is a circle if conjugate diameters are always 
equal (or always orthogonal). 


III. CONFORMAL PARABOLA RECOGNITION 


We next present the detailed architecture and 
then the mode of operation of a BA implementing 
the conformal recognition algorithm. A BA is an 
array of finite state automata (cells) linked by a 
locally controllable switching network permitting 


. 2. 
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communication channels (busses) to be established 
between widely separated cells [2]. The busses 
are assembled from links, each of which conducts 
or not in accordance with the state of the 
particular cell with which it is associated. 


ls ARCHITECTURE 


The BA architecture is shown in Figure l. 


CONTROL BOR DERS 


CONTROL PLANE 1 


CONTRUL FLA 


SEMAPHORE AND OUTPUT CELLS 
RIGHT COLUMK 


FIGURE 1 - BA Architecture 


Three data planes, all n x n arrays of cells, are 
used, each representing the z-plane. Planes 1 
and 2 execute the bulk of the algorithm, while 3 
stores input (and intermediate) information. Each 
data cell has 3 (directed) links to each of its 8 
nearest neighbors within its data plane, i.e., 
those cells sharing either a face or an edge with 
it. Cells in adjacent data planes, have links to 
cells with the same row and column coordinates. 
Four control planes border the edges of the data 
planes and have "right columns" perpendicular to 
(the) data planes at their corners (see Figure 1). 
Each of their control cells has links to its 
nearest neighbors within its control plane, as 
well as between planes at right column junctions, 
The top three rows of the control planes form 

the control borders. The control border cells are 
the only control cells linked directly to data 
cells in the data plane they bound. Finally, 
certain cells of the bottom row of each control 
plane are designated as semaphore cells, with one 
of them designated as the output cell. 


In the following discussion, d, s, F, and & 
will denote directrix, axis of symmetry, focus, 
and latus rectum (the line segment parallel to d 
at F with endpoints on the parabola), respectively. 


CONFORMAL PARABOLA RECOGNITION 


The pattern to be recognized is initially 
encoded (loaded) into the states of the data 


cells. Other information effectively encoded into 
the initial state of a cell is: the type of cell 
(data, control, semaphore or output); and its 
general location, i.e., data plane 1, 2 or 3, 
control 1-4, whether the cell is on a control 
border, right column or other special site. To 
initiate the algorithm, an external signal is sent 
to the semaphore cell controlling Step 1. 


The synchronization of separate subtasks of 
the procedure is accomplished by signals sent 
between semaphores and the rest of the cells. 
Each semaphore oversees a particular subtask: 
starting it; awaiting completion; and passing 
control to the next or to the output cell (halt). 
Termination signals are sent to waiting semaphores 
by designated cells at the control borders of the 
data planes. When the final step is complete, the 
result, encoded in the termination signal of the 
last subtask, is displayed at the output cell. 


The restricted arrangement was chosen for 
convenience in describing and implementing the 
algorithm. Plane dimensions (n) are large enough 
to contain the entire parabolic piece (including 
vertex, focus and a piece of the directrix) con- 
stituting the curve to be recognized. If we had 
done everything in a single large plane, the dia- 
grams and opewation would have been complicated 
by the necessity to construct geometrically 
complicated bus systems. 


2.1 LOCATE THE DIRECTRIX -- Step 1 finds d by 
using the fact that perpendicular tangents to a 
parabola meet at d. 


The curve, stored in data plane 3, is loaded 
into 1 and 2. We examine all candidate vertex 
positions corresponding to a pair of translatable 
right angles simultaneously by setting up, in the 
buasses of data planes 1 and 2 two distinct ortho- 
gonal coordinate grids. The slopes of their axes 
are exactly the slopes of the rays of the right 
angles. They are chosen as follows: 0(0°) and 
infinity (90°) in data plane 1; and 1 (45°) and 
-1 (135°) in data plane 2. Plane 1 sets up its 
grid by connecting all horizontal and vertical 
links to form busses, 2 by connecting diagonal 
links. Each cell of both planes now has coordi- 
nate busses running through it (4 busses, + and 
- for each coordinate). 


Consider the horizontal tangent (Figure 2(a)); 
the other three are similar. To find it, the 
curve is projected along horizontal busses to the 
control borders. To do this, each data cell 
crossed by the locus sends a signal in both hori- 
zontal directions along its busses. The "tangent" 


busses are selected at the control border by those 


cells which receive a projection signal while one 


of their nearest control border neighbors does not. 


False tangents are also identified in this process 
(Figure 2(b)). They correspond to the situation 
in which a neighboring data cell of one of the 2 
control endcells (at opposite ends of the tangent) 
is crossed by the locus. 


Data planes 1 and 2 concurrently determine 
tangents using the above method along both their 
coordinate directions. If the pattern is a para- 
bola, two cases can occur. First, if two 
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Coordinate Bus System with Hori- 
zontal and Vertical Links 


(b) Plane 2: 


Coordinate Bus System with Diagonal 
Links 


1) Lines x = constant project here (and 


opposite side) 


2) Lines y = constant project here (and 


3) 


opposite side) 
False tangent 


FIGURE 2 - Tangent Determination 


orthogonal tangents in each plane are found, then 
two cells on d can be determined (corner cells) 

as follows. 
projections perpendicular to the data planes to 
cells in the data plane 1. 
immediate algorithm for plotting a line from two 
distinct points on it (see Appendix I) to repre- 
sent d in the states of the cells crossed by it. 


Both corner cells are mapped by 


There we run an 


The second case is the special case in which 


the rays of one of the right angles are parallel 
to d and s, respectively. 


Five tangents, two 


false and parallel, are found. In one plane, we 
find a corner cell as before. In the other, we 
find the two false tangents parallel to the sym- 
metry axis, and a true tangent in the other direc- 
tion at the vertex. Hence d is determined by its 
slope and a point on it (the slope is that of the 
true tangent). 


2.2 STEP 2: ORIENT GRID AND FIND FOCUS -- In this 
step, we first rotate the coordinate system (in 
data plane 1) making d the new y-axis. For unique- 
ness choose that intersection of d with an edge of 
the data plane which is closest to a designated 
edge as the center of rotation, and define the 
positive direction along d as the one making its 
slope negative in the old coordinate system. We 
define the positive direction of the new x-axis 
(perpendicular to d) as that which translates us 
from d toward the parabola. The final coordinate 
system (x',y') will have the origin at F, x'-axis 
along s, and y'-axis parallel to d (along 2). To 
construct the (x',y') grid we use d as a "template} 
i.e., d (which is represented by a connected set 

of cells) is translated parallel to itself, the 
distinguishable translates being representatives 

of y' coordinate lines. The translation process 
also generates representatives of x' coordinate 
lines. 


Translation takes one step. The (x',y') 
system will be set up in data plane 1, while plane 
3 stores the putative parabola and intermediate 
computations. The translational bus structure is 
constructed as follows. First, within the data 
plane, like diagonal links are connected "head to 
tail" to form a Cartesian grid rotated 45°. At 
the control border, each cell connects each diag- 
onal input link to the output link (into the data 
plane) of the opposite diagonal type. Recall 
that each cell was originally given 3 sets of 
input and output links. One set propagates 
template data, the other two will be reserved for 
the x' and y' coordinate busses. 


Since d is a straight line, each cell of d 
has neighbors with which it forms one of the 
three configurations shown in Figure 3a. These 
are: nearest neighbors (1) above and to the 
right (labelled A); (2) to the left and below (B); 
or, (3) above and below (0). Type (3) corresponds 
to O's of the straight line code [6], types (1) 
and (2), in pairs, to the 1's. To form the y' 
coordinate lines, cells of d signal along the 45° 
busses as follows. Each cell signals its label. 
Data cells receiving them connect links (to form 
busses parallel to d) correspondingly. If '‘'A' 
is received, links at the top are connected to 
those to the right. If 'B', left links are 
connected to the links at the bottom. If '0O', 
links are connected top to bottom. 


The x'-coordinate busses are set up by the 
same signals, after reflection by the control 
borders. This makes them come in on the 135° 
busses (Figure 3a). If an 'A' is so received, 
bottom links are connected to those to the right. 
If 'B', left links are connected to links at the 
top. If '0', links are connected left to right 
(Figure 3b). Putting all links together in this 
fashion covers the plane with x'-coordinate busses 
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in the same way that y'-bus formation covered 
the plan with representatives of the y'-coor- 
dinates previously. 


We now copy the "parabola" into data plane 
1, and find F. We use the fact that the vertex 
tangent t is halfway from d to &. The tangent 
algorithm of step 1 is used; t is parallel to d 
and thus a y' coordinate line. 


We find & as follows. If d, t and 2% cross 
a common border, & is easily found by marking 
off a distance equal to that from d to & (see 
Figure 4 in which the data and control planes 
are drawn in one plane). This requires only 
diagonal and vertical busses as shown. 


Coarhot. 
Li ol 


FIGURE 4 - Finding Focus F 


The following complication can arise. If d 
hits the control border so that two horizontally 
connected cells would lie along it (Figure 5a) 
we must "short out" a column of the control plane 
beneath one of them (Figure 5b). 
can be done at all control borders when the (x', 
y') system is constructed, all such problems are 
handled automatically. 


O) COORDINATE. Bus 
® SHORTED CELL 


DO 


o(b) 


FIGURE 5 - Shorting 


If t or &£ hit the control border perpendicu- 
lar to that crossed by d, the diagonal busses 
must effectively "wrap around" the cube faces 
constituting two adjacent control planes. Figure 
6 shows all cases, including the one already 
discussed. Part (d) illustrates the fact that 
no special arrangements need be made for these 
cases either! 


FIGURE 6 — Bus Wrap Around 


Finding F is now almost trivial. We merely 
bisect the segment of 2 whose endpoints lie on 
the putative parabola (Figure 4). The x'-line 
containing F is s and also the x'-axis. The 
latus rectum is marked on it by this process. 


2.3 STEP 3: PLOT THE GRID PARABOLAS -- We must 
plot adjacent parabolic coordinate lines, between 
which (or on one of which) the candidate para- 
bola must lie. We compute them in parallel on 
planes 1 and 2. The data stored in plane 3 is 
copied into both, and the (x',y') grid, d, etc., 
also appears in both. 

Let the 2 grid parabolas be images under 
and v=v,+l1 in the 


0 0 
The coordinates of their vertices in 


g2¢ (wien of the lines v=v 
w-plane. 


the (x',y')-plane are (-v,0) and (-(v_#1),0). 


As this shorting 
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Hence, if x, is the distance from F to d, it 


follows that 


2 2 
Vo 2% < (oth (9) 
We use (9) and the fact that 
14+34+5+ ... +(2n-1) = ar. 
for n = 0,1,25. ss. (10) 


to find the vertices. 


All grid parabolas have their vertices to 
the left of F along s at distances 1, 1+3, 1+3+5, 


ees i. ... The laying off of the successive 


segments 1,3,5, ..., (2n-1), . can be done 
immediately (see Appendix II). The vertices of 
the two parabolas sought are then found immed- 
iately by comparison with the candidate. It 
must be remembered that here distances are 
measured along (x',y')-coordinate lines, but 
these show up as the same distances along the 
control borders ((x,y) coordinates) provided the 
"shorting" described earlier has been carried 
out. This permits discussion of "oblique" 
cases with "rectangular" diagrams! 


The next step is to find the points at 
which successive chords meet on the parabolas 
(see eqn. (6) for their slopes). Starting from 


2 
the vertex as (-vo> 0) we lay off constant incre- 
ments ZV. along the y' axis and increments 1,3, 
D5 


than the length of the largest segment (immed- 
iately to the right of the vertex), and marking 
it off along the y' axis immediately. The 
procedure is similar to the doubling procedure 
discussed in connection with Figure 4. The x' 
increments are found immediately as before 
(Appendix II). The x' coordinate lines at each 
y' increment "height" and the y' coordinate lines 
at each x' increment intersect in a network of 
cells including those sought. The cells in that 
network are put in a special state by a signal 
along the coordinate busses from control cells 
on them. In that state they connect + directed 
links of x' coordinate busses to + y' directed 
links "above'' s and to -y’ directed links "below" 
s. At a signal from the vertex the network 
cells undergo a transition to a state indicating 
they are on the parabola. It can be seen that 
the signal from the vertex follows a pair of 
paths which arrive only at the cells stated. 

See Figure 7. 


., along the x' axis. We get 2v as 1 more 


By using the line routine of Appendix I on 
the rectangle bounded by path and successive 
points we obtain the chord approximation of the 


parabola with vertex at (-v., 0). 
In the other plane, the "parabola" with 


vertex at (-(v +1)", o) is formed similarly. 


2.4 STEP 4: RECOGNIZE -- Recognition is simple. 


The grid "parabolas" and the candidate are 


loaded into one data plane. All data plane 
links are then connected except those crossed 


lt. 


iT 


FIGURE 7 - Plot Grid Parabolas 


by either of the two grid parabolas. The candidate 
parabola cells then emit a signal. If any cell 

of the control border other than those lying 
between the grid parabolas receives this signal, 
the candidate is not a parabola, otherwise it is 
accepted. 


IV. CONIC RECOGNITION ALGORITHM 


The BA architecture for conic recognition is 
much simpler than Figure 1. It consists of an 
n x n data plane surrounded by a control border. 
Links are as before. The lower left corner cell 
serves as output and semaphore. Achieving 
immediacy of part of the algorithm needs a third 
dimension obtainable by making the square one 
face of a cube (or square prism). 


In IV.1 we present the key subroutine, which 
given any slope m finds the conjugate slope and 
corresponding diameter(s). The whole algorithm 
is sketched in IV.2. 


1. CONJUGATE SUBROUTINE 


Tagging two data cells specifies the slope m 
of the line determined by them (Appendix I), and 
the routine for constructing the (x',y') 
coordinate system of III2.2 then covers the data 
plane with parallel lines of slope m. It is 
easy to check, for any m, whether the candidate is 
cut by a line in more than two points, using 
signals from the control border. If the candidate 
is cut in exactly two points it is easy to find 
them and the midpoint of the chord they cut off 
by earlier routines. All the midpoints of all 
chords of slope m can be found simultaneously 
and the straight line recognition algorithm of 
Appendix I applied to them. If it fails the 
candidate is not a conic. If it succeeds the 
locus is the conjugate diameter. 


2. CONIC RECOGNITION 


We must omit detailed discussion of the 
general algorithm and the many special cases 
that arise because of lack of space, but this 
should not impede its comprehension. The 


‘central conics, 


procedure is essentially a translation of the 
geometric terminology of II.2 into BA operations. 
All talk there of chords, diameters, etc., 
usually involves little more than the subroutine 
of 1 above, sometimes more than once, and the 
algorithms of Appendix I. The other considera- 
tions involved often need distinctions to be 
made of a topological nature, e.g., whether a 
curve is closed. This is easily done as in [1], 
and distinguishes ellipses or circles among 
Parabolas are distinguished 
from central conics not only by having this 
"center" at infinity, and thus never in the data 
plane, but also in having the "diameter" conju- 
gate to a set of finite parallel chords always 
intersecting the control border. All the 
distinctions possible, to the resolution of the 
BA, can be carried out in analogous fashion, 
using the appropriate criteria. 


V. ERROR CONSIDERATIONS AND CONCLUSIONS 


Error analysis for BA algorithms generally 
and those of this paper specifically are of wider 
interest than the cases examined, for the general 


problem faced is no less than that of discrete 


representation of continuous quantities. Also, 
errors here go beyond those in numerical analysis, 
for they apply to patterns (perhaps functionals) 
in multidimensional spaces rather than to mere 
numbers. Many considerations like these are 
implicit or explicit in the topological dis- 
cussion of [1], and they also apply to the cases 
at hand. 


Conic identification errors arise from 
limited BA resolving power in many variations. 
For example, inability to distinguish between an 
end piece of a long narrow ellipse, a narrow 
parabola or one sheet of a hyperbola with a 
small angle between its asymptotes can be purely 
a question of resolution. Interestingly enough, 
the "kinks" of oblique straight lines never are 
a source of error in themselves, i.e., their 
resolution capabilities are precisely those of 
horizontal or vertical lines of cells. 


The BA resolving power can always be 
referred back to the number of cells in it. For 
if we double the size of a "square" BA we can 
always label the new one with half integer 
coordinates to give twice the linear resolution 
of the smaller one. 


When we change coordinate systems, e.g., to 
the confocal parabolic case of II, note that 
near the singularity many parabolic "squares" 
fit into one cell and become indistinguishable. 
Far from the focus, the parabolic cells become 
very large, containing many cells. The Jacobian 
of the transformation between systems always — 
measures the local ratio of cells of the two 
kinds. Error theory with a numerical analysis 
flavor comes up when these considerations are 
pursued. The adequacy of chordal and tangent 
approximations to curves (and specifically the 
parabola) becomes better and better the farther 
one gets from the focus (for fixed cell size). 
It should be possible to develop an approximation 
theory using straight lines (or planes in the 


3D case) which will permit immediacy on a BA for 
many processes (e.g., curve fitting, integration, 
etc.). We believe the field of BA error analysis 
has much promise. 
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APPENDIX I 


We show here how to determine a straight 
line from two points. SEE ERRATUM. 


In Figure Al.1(a), all tagged cells are 
shaded, the two dotted ones being the initially 
tagged pair. Take the coordinate lines as running 
through the centers of the cells; four such lines 
are illustrated going through the two dots, 
forming a rectangle. The slope of the line is the 
height of the rectangle divided by its base, here 
2/4 or 1/2, in general p/q. We can take p < q 
with no loss in generality (interchange x and y 
axes if p <q). In effect we have p and q given 
in stroke notation, and the slope is immediately 
calculable by the "number theorist" BA of [1]. 
Essentially the same calculation gives the 
straight line code [6], from which "tagging" 
information can be immediately derived and bussed 
to the appropriate cells. The following is 
essentially the treatment to appear in John 
Mellby's dissertation. 


The code assigns 0 to a cell which is part of 
a horizontal run (Figure Al.1(b)), 1 to a pair 
of cells corresponding to a rise (Figure Al.1(c)). 
The two 1l-cells were labelled A and B in step 2 
of the conformal parabola algorithm. Figure Al.2 


I. Todhunter, A Treatise on Plane Co-Ordinate 
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seat 


(c) 


(b) 
FIGURE Al.1 - Tagged Cells of a Straight Line 


indicates how the code is determined from p and 
q. Marking p and q along an axis as shown is 
easily done from Figure Al.1 using 45° busses to 
make q cells along the column of cells containing 
the right dot. 


—— 


uy 


O/t 


FIGURE Al.2 - Code Determined From p and q 


We wish to calculate the code of a line 
whose slope is p/q. Basically we have to find 
the cells in which the line will pass vertically 
to the next cell. 


The following observations are elementary. 
When a line with slope p/q has gone one unit 


horizontally, it will have gone pig. of a unit 
vertically. When it has gone two units hori- 
zontally, it will be 2p/q units vertically, and 
so on. Also when the line has gone one unit 
vertically, it will have gone q/p units hori- 
zontally. 


Now note that for each k where kp/q is 
greater than a unit, and (k-1)p/q is less than 
that unit, this means that the line has passed a 
vertical unit. In BA terms, it has crossed 
vertically into another cell. Also, it has done 


this in the es) (horizontal) cell. Then the "1" 
digits of the code are in the k,> kos .. cells 
where k. is defined by . 


k.-p/q > i> (k,-1)+p/q. 


Rothstein and Weiman performed this opera- 
tion using two shift registers of length p and q. 
The registers shifted simultaneously$ as the p 
register completed each cycle, it indicated a 
horizontal move of one cell. (Note that after 
the first p cycle, the q register had completed 


pie of its cycle. This indicates that the line 


has gone ay pc of a unit vertically.) Then when- 
ever the q cycle ends indicates a vertical move 
of one cell. 


We execute this process immediately using 
an interval marking algorithm. Note that if a 
line is marked at an interval of p cells, those 
cells could indicate the cycle of a p-length 
shift register. Then to show two registers, we 
mark the line at intervals of p and q (Figure 


Al.2). The end of each p cycle, or the p marking, 
indicates a horizontal shift, which we will call 
for now a '0', and the q marking is a vertical 
shift or a '1'. Whenever we have both p and q 
markings is a vertical and horizontal shift, and 
this is the end of one cycle of the code. 


To convert this to the line code, note that 
a '1' in the line code denotes a vertical and 
horizontal shift, while our '1' is only a vertical 
shift. Thus we must combine each of our '1's with 
an '0O' after it, so we create a special compres- 
sion routine which will delete an 0 for each 1 
(Figure Al.3(a)). When we compress out the 
"blanks" (Figure Al.3(b)), it becomes the straight 
line code for a line of slope p/q. Notice that 
we can obtain any number of cycles of the code, 
i.e., generate any length of line we wish. 


(a) (b) 


FIGURE A1l.3 - Code Compression 


Once we have the line code, we can use it to 
create the line itself on the BA. 


Each digit of the code determines the type 
of transition of the line through that row. How- 
ever, each separate digit does not in and of it- 
self determine which cells that line runs through. 
We can create a configuration of busses which 
will generate the line. In each row all cells 
create busses connecting them with cells in the 
adjacent rows, according to that row's code digit. 
(If the digit is 0 then no rise is indicated so 
cells in the same row are connected, Figure 
Al.l(a). If the digit is 1 then cells are 
connected so that the bus "rises" one cell, 

Figure Al.1(b)). 


With this bus system, the origin cell of the 
line need only send a signal on these busses to 
signal the entire line. 


The complementary problem to line deter- 
mination is recognizing whether a pattern of tag- 
ged cells actually is a straight line. If we 
have the pattern representing a line we can deter- 
mine whether this represents a complete segment 
of the line's code. 


There are two parts to this process. First 
we can find the endpoints of the pattern and using 
the line determination routine, find the code of 
the line which passes through those points. Then 
we can determine the code of the pattern itself 
and compare the two. If the codes match, the 
pattern is a line. 


We already know how to determine the code 
from endpoints. To determine the code of a 
pattern, each cell in the pattern "examines" its 


neighbors. Each cell must be a run cell, or one 
of the two bend cells (Figure Al.4). If the 
cell has only neighbors to the right and left, 
it is a run cell, and indicates a '0' in the 
code. If a cell has a neighbor above it, it 

is a bend cell and indicates a '1' in the 

code. If a cell has a neighbor below it, it is 
a bend cell, but the cell below it will indicate 
the code. 


O 
Ol|xi0 0 |x| 


x0 
iO 


FIGURE Al.4 - Run and Bend Cells (x) 


When each cell has determined its own state 
it sends a signal to the "goal'' line showing its 
state. Then the goal line contains a copy of 
the code. This process is illustrated in 
Figure Al1.5. 


FIGURE Al.5 - Code to Goal Line 


APPENDIX II (By John Mellby) 


We wish to mark a row of cells at points a 
e Th 
defined by 


They divide the row into segments of length 
1 given by 


2 et 2 
1, =n (n-1) 


To accomplish this, we create bus configura- 
tions to generate the separate segments, and then 
concatenate them. 


Observe that whenever we send a signal on 
a diagonal bus, for each cell the signal travels 
along the diagonal it goes one cell horizontally 
and vertically. Thus, if the signal travels k 
cells along the diagonal and we project this 
upon the marking line we have gone k celis on 
the marking line (Figure A2.1). 


<5 —_ > 
FIGURE A2.1 - Diagonal Busing 


The bus configuration to do this will consist 
of diagonal and vertical busses. The general cell 
consists of bus connections to travel diagonally 
and vertically (Figure A2.2(a)). The cells in the 


2 row, receive the signal from the diagonal 


and transmit it on the vertical (Figure A2.2(b)). 
Finally, the marking row receives the signal from 
the vertical bus (Figure A2.2(c)). Then the 
entire pattern looks like Figure A2.2(d). 


A} 
{a) (b) (c) 
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(4) 


FIGURE A2.2 - Individual Plane Configurations 


Each of these segment configurations can be 
done in a separate plane, then the planes 
connected (3D BA). 


To generate the segments in the correct 


sans t ee 
position on the row, assume the i segment is in 
its proper place (i.e., cells a +1 through ay) 


k-1 
Then the ei segment will start at ay tl and will 
be dy cells long, ending at aad’ Since ay starts 


by definition in the proper location we have, "by 
induction", the segments with the proper length 
at the proper points along the line. 


If we number the plane containing the marking 
row 1 and each successive plane accordingly, then 


the ee eee will contain the configuration for 
marking a segment of length 1,. The beginning 


of the a signal comes from the (st) =" aiane, 
and when the signal ends it will go to the 


(44142 plane. Let the symbol (in the ages 


plane © indicate a bus coming from the (4-1) 


plane and @ indicate a bus going to the (itl)— 
plane. Then a cell in the marking line looks like 
Figure A2.3(a). 


@ Bus FROM TREV lous 
PLANE 


| @ Bus To NExtT PLANE 


FIGURE A2.3 - Interplane Busses 
Now a signal is sent on the first diagonal 
bus in plane 1. It will mark the a, cells in the 


individual planes, then we project these marks 
onto the marking row in plane 1 and the row is 
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marked as desired. 


To create the configurations in each plane, 
we need only set, in each plane k, the number 2,. 
Then the configuration of busses in each plane 


can be constructed based on the number dies 


To send the numbers (25) to the planes, the 


th ; 
hVertical 


cell, and sends the signal on in the 2 


i plane receives a signal in the 2 


th 
K+] cell. 


The pattern for this is a steep diagonal (slope 


2) in a plane perpendicular to the (== plane. 
See Figure A2.4(a). The inidividual cell is 
as in Figure A2.4(b). 


(a) 


FIGURE A2.4 - Steep Diagonal 


In summary, this routine constructs the 
busses of Figure 2.7 signals he in each plane, 


constructs the individual plane configurations 
(Figure A2.2), sends signals through these 

busses and projects the markings onto the marking 
row. Each of these operations is immediate so 
that whole routine in immediate. 


Note the very close similarity in structure 
and operation of this BA to the "number 
theorist" BA described in [1] p. 102, Figure 3.3. 


ERRATUM 


The first paragraph of Appendix I should 
read as follows: 


We show 
any straight 
i.e., how to 
points on it. 
nize whether 
uration. 


here how to "tag" the cells along 
line given two tagged cells on it, 
determine a straight line from two 
Later we also show how to recog- 
tagged cells are a straight config- 


AN ARCHITECTURE FOR PARALLEL PROCESSING 
OF "SPARSE" DATA STREAMS 


Tom Trilling 
Technology Service Corporation 
Santa Monica, California 90403 


Abstract -- This paper describes a type of 
special purpose architecture that has been de- 
signed for a class of problems involving parallel 
data streams that contain significant information 


only at occasional random unpredictable intervais. — 


(Such data streams are termed "sparse," analogous 
to a sparse matrix.) In these problems, it is im- 
portant for the processor to be able to sense and 
analyze these occasional intervals that can be 
called "bursts of activity." The important fea- 
ture of the architecture is that a relatively 
small number of processors are shared among a 
large number of data streams. This sharing re- 
sults in a dramatic saving -- sometimes orders of 
magnitude -- in the number of processors required 
as compared with conventional architectures in 
which a processor is dedicated to every data 
stream. 


To illustrate this architecture a configura- 
tion designed for a problem in the area of radar 
target detection is discussed in some detail. A 
computer simulation of the performance of this 
architecture has been made, and, as will be shown, 
the results match the predicted savings. (This 
paper is abstracted and adapted from my Ph.D. 
dissertation [1].) 


'Sparse'' Data Streams 


Before describing the processor-sharing ar- 
chitecture, (this term will be used for brevity) 
it is useful to discuss "sparse" data streams and 
the type of problems for which the architecture 
was developed. The term "sparse" need not be de- 
fined, but it seems reasonable to apply it to da- 
ta streams containing significant data less than 
say five percent of the time. Examples from two 
quite different areas, medical care and radar 
target detection, will illustrate this class of 
problems and the need for this architecture. 


Medical Data Example 


Many examples exist in the field of medical 
data, but one that seems especially interesting is 
the output of electrocardiograms, ECGs, which are 
used to monitor heart function. An ECG measures 
electric potentials between various parts of the 


heart. It is one of the most effective methods of 
detecting dangerous heart conditions. A normal 
ECG trace is shown in Figure l(a) [2]. Figure 


1(b) gives a detailed view of a single beat pat- 
tern, which contains three segments, known as: 

the P wave, the QRS complex, and the T wave [2]. 
Analysis of deviations from this normal pattern is 
used in the diagnosis of various heart disorders. 
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Figure l(a). The Normal ECG 
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Figure 1(b). Detailed View of a Single Beat 
One important type of abnormal pattern is 
called a PVC, or premature ventricular contraction 
as illustrated in Figure l(c) [2]. PVCs are of | 
interest because they are possible indicators of a 
dangerous condition, but can also occur in people 
with perfectly sound normal hearts. The critical 
factor is the frequency, timing, and shape of the 
PVCs. Automatic processing is capable of monitor- 
ing ECGs and issuing an alarm when dangerous pat- 

terns occur. 


Figure l(c). 


Trace with Frequent Premature 
Ventricular Contractions 


Patients in the hospital for reasons other 
than heart problems do not normally have ECGs mon- 
itored, and occasionally serious heart problems 
develop which are not detected in time, resulting 
in damage to the heart or even death. Unfortun- 
ately, it would be prohibitively expensive to 
monitor every patient continuously with an ECG, 


either by the conventional paper recorder, or by 
dedicating a sophisticated processor to him. How- 
ever, a data processing system taking advantage of 
the "sparse" nature of the data stream could moni- 
tor many patients with relatively few processors. 


Radar Target Detection Example 


The following example from radar target de- 
tection is similar to problems in a number of re- 
lated areas such as communication networks, radio- 
astronomy and sonar systems. 


A radar system locates targets in space by 
emitting energy and sensing the reflections of 
that energy from the targets. The energy re- 
flected by the targets and returned to the radar 
is usually at a very low power level, often of the 
same order of magnitude as the thermal noise that 
is always present. This means that it is often 
difficult to distinguish between targets and 
noise. The method generally used is to correlate 
the results of a number of "looks" at the target, 
using statistical decision methods to achieve the 
best possible discrimination between targets and 
noise. 


A two-dimensional scanning radar is illus- 
trated in Figure 2. In this figure, the angular 
sections, numbered 1, 2, .. ., N, represent the 
central portions of successive beams transmitted 
in sequence as the antenna rotates in a clockwise 
direction. The radial dimension represents range. 
The range to the target is found by measuring the 
time interval T,, between the pulse transmission 
and the arrival of the return. This time inter- 
val, multiplied by the speed of the electromagnet- 
ic propagation, c, gives the round-trip distance 
from the radar to the target. 


eee agen DIRECTION 


Figure 2. Radar Scan Pattern with Range-Azimuth 


Data Streams 
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In the sensing of return energy, the radar 
receiver acts as an "integrate and dump" circuit 
over a time equal to the transmitted pulse width, 
t. This process imposes a granularity on the 
range measurement of ct/2, which has the effect of 
dividing the surveillance range into what are 
called range bins. In Figure 2, the range bins 
are numbered from 1l1...j. . . OD, where D rep-— 
resents the maximum range of the surveillance vol- 
ume. The number of observations or "looks," N, is 
dependent on the azimuth width of the transmitted 
beam. The azimuth data inputs in each of the D 
range bins are excellent examples of "sparse" 
parallel data streams because usually only a few 
if any targets are among the many range bins in 
space. 


There are many statistical decision methods 
used. A common method is to set an amplitude 
threshold and to call an input signal exceeding 
the threshold a "hit" and to call an input below 
the threshold a "miss." A detection decision is 
then based on getting a specified density of hits. 


The factor common to these examples is that 
the data streams are "sparse" but that the unpre- 
dictable bursts of activity are of great impor- 
tance and must be sensed and carefully analyzed. 
However, it would be inefficient to dedicate a 
processor to each data stream just for those in- 
tervals. It would be much more cost effective to 
share processors among the data streams. This is 
the motivation for the architecture to be des- 
cribed in the next section. 


Processor Sharing Architecture 


The processor sharing architecture is shown 
in a generalized form on Figure 3. This basic 
design would be modified and adapted for specific 
applications such as on the specific example that 
follows. 


Basic Design 


The major feature of this architecture is 
that there is a bank of processors that can be 
assigned to data streams as desired. Only the 
relatively few data streams having a burst of ac- 
tivity are assigned to a processor. The great 
majority of data streams are not assigned to pro- 
cessors, and their inputs are not used until 
there is an indication of a burst of activity. 
The functional areas that accomplish these objec- 
tives are: the input bus, the input sequencing 
control unit, the » evaluator, and the evaluation 
data routing logic. To understand the operation 
of this architecture, it is necessary to know 
both the function of each unit, and the major 
pathways of information flow. 

All data streams feed the input bus. Gener- 
ally the current input from one of the data 
streams is selected from the bus according to a 
signal from the input sequencing control unit. 
However, in some cases, a multiport bus might be 
used. The input sequencing and selection, and 
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Figure 3. General Architecture for Processing 
of "Sparse" Data Streams 


the resultant sampling rate is very much dependent 
on the problem. 


The y evaluator makes a preliminary evalua- 
tion of each input from each data stream. Data 
that appears significant, such as PVC or an input 
exceeding the threshold, would cause the data 
stream to be assigned to a processor. As a sim- 
plification, it will be assumed that results of 
the ~ evaluation is one of two values, either +K, 
data appears significant; or -l, data not signifi- 
cant. 


The evaluation routing logic uses the results 
of the y evaluator and the status of the data 
stream to decide what to do with the input data. 
There are three possible courses of action: 1) 
if the input is from a data stream previously des- 
ignated as active, then a processor would already 
have been assigned to that data stream. In this 
case, the y evaluation result, whether +K or -l, 
is sent to the assigned processor, 2) if the input 
is from a data stream not previously designated as 
active, but the y evaluation is +K, then the data 
stream becomes designated as active. The top pro- 
cessor from the stack of available processors will 
be assigned to this data stream, and the evalua- 
tion result will be stored by the newly assigned 
processor. As will be described, this assigned 
processor now becomes part of the assigned section 


of the processor bank. In the rare case that 
there is no processor available for the new active 
data stream, an overflow signal is generated, ac- 
tivating priority logic. This logic determines 
which of the data is of least value and should be 
dropped. (The A parameter, described shortly may 
be useful for this decision), and 3) if the input 
is from a data stream not previously designated as 
active, which is mast often the case for "sparse" 
data streams, and the ~ evaluation was -1; then the 
input is simply dropped. (Thus the y evaluator 
screens most of the data inputs, and keeps them 
from ever being a load on the processor.) 


The architecture does not require any partic- 
ular type of processor. However, it is efficient 
to have the processor compute a significance pa- 
rameter, A. This parameter is used in the data 
processing decision, and also serves to rank data 
streams for priority in overflow situations. One 
such processor, the sequential observer, will be 
described in the example that follows. 


A processor will consider a new input in con- 
junction with the results of the evaluations of 
previous inputs from the data stream. The proces-— 
sor has three possible options. First, it can 
make a decision that some status of interest ex- 
ists, such as a dangerous heart condition, or a 
radar target, and issue an alert. Second, it can 
decide that the data stream is no longer active 
and drop the data stream. (In this case, the pro- 
cessor becomes unassigned, or available, and is 
switched to the available processor section as 
will be described.) Third, if the processor does 
not have enough information to make a decision 
with adequate confidence, then it will wait for 
more data, remaining assigned to the data stream. 


As indicated, the bank of processors is di- 
vided into two sections. The section indicated on 
the left contains processors that have been as- 
signed to active data streams. Once assigned to a 
data stream, a processor will normally remain ded- 
icated to that stream until it either recognizes 
some status of interest, and issues an alert, or 
else makes a negative decision, and drops the data 
stream. When one of these decisions is made, then 
the processor is transferred to the available pro- 
cessor section. This section, shown on the right 
contains unassigned, or available processors. | 
These processors are organized in a pushdown stack 
so that they are available as needed [3]. That is 
the "top" processor is taken first, and its remov- 
al exposes a new "top." 


One of the major design problems of the pro- 
cessor sharing architecture is the method of re- 
cognizing whether or not a data stream has been 
assigned to a processor. If the answer is "yes," 
then it is necessary to be able to locate the as- 
signed processor as promptly as possible, in order 
to properly route the new data. 


The assigned processors will be organized to 
facilitate both the recognition of assigned data 
streams, and the location of the assigned proces- 
sors. In a few problems, the inputs from the da- 
ta streams might occur at unpredictable intervals 
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and in random sequences. For this case, a content 
addressable method could be of great benefit [4]. 
Another possibility would be to order the proces- 
sors by some parameter so that a binary search 
could be used. However, rearranging of the pro- 
cessors would be necessary to maintain this order 
whenever data streams changed from inactive and 
vice versa. In some cases it might be simplest to 
just put all the active processors in a block of 
consecutive locations. This would minimize the 
search area. 


For most applications, all the data streams 
are sampled at the same rate, in a repetitive se- 
quence. For such applications, a double linked 
circular list organization of the processors would 
be the logical choice, because it permits immedi- 
ate recognition of assigned data streams and gives 
the location of the assigned processor without any 
searching [3]. The method by which this is accom- 
plished is better explained with the illustrative 
example given in the next section. 


Specialized Configuration for Radar Target 


Detection 


Figure 4 gives an example of a specific con- 
figuration of the processor sharing architecture, 
one designed for radar target detection. This > 
configuration follows the basic design of Figure 
3, but the general functional areas have been 
adapted to the problem. In this problem the data 
streams are the consecutive inputs corresponding 
to the same range bin. However, the receiver data 
comes in on a single input line. This single in- 
put line is thus a very rapid repetitive sequence 
of inputs from each of the range-azimuth data 
streams. Each input sample has a duration equal 
to the effective pulse width of the radar, which 
is typically of the order of 1 usec. The yp eval- 
uation is simply a threshold crossing detector. 
An input is significant if it exceeds the thresh- 
old (i.e., a hit), otherwise, it is not. 


A major problem in the design of this config- 
uration is the matching of data stream inputs with 
their assigned processors. That is, when a hit is 
received, it is necessary to quickly determine 
whether this input is from a currently active data 
stream already assigned to a processor, or if it 
is the first hit from a previously inactive data 
stream. The data rate of the single input stream 
that combines all the data streams is too high for 
a memory search method to be used. However, there 
is a method of organizing the processor memory 
that makes use of the inherent ordering of the da- 
ta streams to match data streams with their as- 
signed processors without a memory search. 


Since the data streams represent consecutive 
range bins, the inputs are always in order of in- 
creasing range. Thus, the range can be used as an 
identification number for the data streams, and 
also indicates the input sequencing order. The 
processor memory is organized to take advantage of 
this ordering by having all processors stored in 
order of increasing range of the assigned data 
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Figure 4. Configuration for Radar Data Streams 
streams. A double linked circular list structure 
is very effective for this purpose. In this 
structure, all elements contain pointers to the 
preceding and following elements. Thus, every 
element contains sufficient information to make an 
addition or deletion in the list, without having 
to search the string. This list modification cap- 
ability is essential in this configuration, be- 
cause the processors must be kept in range order. 


With the processors in range order, the prob- 
lem of association of new inputs with assigned 
streams becomes quite simple. The method used is 
to designate the "oldest" processor as the one 
longest without an input. The range of this pro- 
cessor is stored in a register as indicated. 

There is also a pointer that always indicates the 
location of the "oldest" processor. On each new 
input, the range of that input is compared with 
the range of the oldest processor. If the range 
of the current input is less than that of the 
"oldest" processor, then data stream corresponding 
to, the current input has not been assigned to the 
oldest processor. Furthermore, the data stream 
has not been assigned to any processor. If the 
data stream had been assigned to a processor then 
that processor would have been ahead of the pres- 
ent "oldest" processor and would be the oldest 
processor now. This association method is a key 
element in this configuration, simplifying the 
design and speeding up the processing. Thus, if 
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the current input is a miss, it is discarded be- 
cause the data stream is not active. If the cur- 
rent input is a hit, the data stream will be as- 
signed to a processor which will be inserted into 
_the list ahead of the "oldest" processor, since 
its range is less. Of course, when the current 


range is identical to that of the "oldest" proces- 


sor, the new data, whether hit or miss, is routed 
to this "oldest" processor. At this point the 
next processor in the list becomes the "oldest" 
processor,. and this sequence repeats. This des- 
cription has been somewhat brief; a detailed ex- 
planation is given in Reference [1]. 


The processors use a digitized form of a sta- 
tistical decision method called the sequential ob- 
server [5]. In essence, this device is an accumu- 
lator and a set of decision logic. The accumula- 
tor is set to an initial value of zero. Ona hit, 
the accumulator is incremented by an amount K. On 
a miss, the accumulator is decremented by one. If 
the accumulator reaches some threshold, T, then a 
detection decision is made. Should a processor 
count down to zero, the data stream is considered 
inactive for the time being and the processor is 
returned to the available stack. A state diagram 
for a sequential observer with K = 3 and T = 7 is 
shown on Figure 5. In this diagram, p is the 
probability that an input is significant, such as 
a hit and q is the probability of a miss. If the 
threshold state, state 7, is reached then a deci- 
sion is made that the status of interest, such as 
a target, exists. This type of diagram is useful 
both for analysis of probability of detecting a 
condition and for analysis of probability of a 
false decision due to random input errors such as 
"noise." Not only is the sequential observer a 
convenient detection device, but the accumulator 
value gives an indication of tne activity of a 
data stream (i.e., a X rating) and is thus useful 
in the processor sharing architecture in assigning 
priority for the limited number of processors. 
That is, if the bottom of the stack of available 
processor is ever reached, then there is an over- 
flow of active data streams and some data must be 
sacrificed; and the data stream with the lowest | 
count would be dropped. 


Performance of the Processor Sharing Architecture 


The use of the processor sharing architecture 
entails acceptance of some sacrifice in recogni- 
tion performance on the unusual occasions that 
pseudo-active data streams tie up all the proces- 
sors. When this condition occurs, active data 
streams cannot be assigned to a processor. 


However, with proper selection of the number 
of processors, according to the system parameters 
and requirements, the loss of performance can be 
reduced to minimal levels, while still achieving 
a great savings in the number of processors re- 
quired. For this analysis, it will be assumed 
that the number of active data streams has a 
Gaussian distribution, with a mean of Ups and a 
standard deviation of Ope 
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Figure 5. State Diagram for Sequential Observer, 


with K = 3 and T = 7 


Let D be the number of data streams, and P 
be the probability that a data stream is active. 
Then Up and o,, are given by: 


B 
Up = DP. : (1) 
=. : — ‘ 
Op VDP, Cee SY, (2) 
If Po is small, then Equation 2 can be approxi- 
mated by: 


VDP, ti (3) 


Clearly the number of processors must exceed 
the mean, Ups by some safety factor. If this 
safety factor is expressed as a probability that a 
processor will be available when required, then 
the following equation is obtained: 

N =i, + Ka On» (4) 


where N_ is the number of processors required and 
K. is the confidence constant expressed in units 
of the standard deviation. 


Thus 


Ko = 1 corresponds to a processor avail- 
ability probability of 0.84 
Ko = 2 corresponds to a processor avail- 
ability probability of 0.977 
and 
Ko = 3 corresponds to a processor avail- 


ability probability of 0.9987. 


The measure of the effectiveness of the pro- 
cessor-sharing architecture is the ratio of the 
number of processors required to the number of da- 
ta streams. This parameter, Re is given by: 


= 
D 


— 


R 


a : (5) 


By substituting Equations 1, 3, and 4 into Equa- 
tion 5, the following equation for Rp is obtained: 


K, VP 
See 


VD 


In Equation 6, the Pg term is a function on- 
ly of the "sparseness" of the data stream, and is 
thus not under the control of the digital system. 
Thus, PR determines the lower limit on the per- 
centage of processors required. The Ko VPp/D 
term is the safety factor, or confidence level, 
and can be varied according to the system require- 
ments. It will be noted that this term gets 
smaller in proportion to Pp as D gets larger. 
Thus, a very high processor availability, such as 
0.9987, can be attained with only a modest per- 
centage increase in processors over the minimum 
level determined by the value of Pp: 


(6) 


The parameter Pp is thus, the critical param- 
eter in the savings achievable by the processor- 
sharing architecture. This parameter is always 
"small" for "sparse" data streams, but the mean- 
ing of "small" depends on the type of problem. For 
radar target detection, Pp would usually be less 
than 0.01, often very much less. For other appli- 
cations, Pp might be as high as 0.05. 


Some examples of the savings achievable by 
the processor-sharing architecture for P, = 0.01, 
and Pp = 0.05 are given in Tables 1 and a In 
these tables, the first column gives D, the number 
of data streams. The second and third columns 
gives up and op respectively. The next three col- 
umns give the number of processors required for 
availability probabilities of 0.84, 0.977, and 
0.9987 respectively. The next column gives the 
ratio, » (i.e., Np/D) for an availability proba- 
bility of 0.9987. Of course Rp, does approach Rp 
as D gets large because the data was computed from 
Equation 6. Additional support for this result is 
given in Reference [1] based on both analysis and 
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computer simulation. The last column shows the 
percentage savings achieved by the processor- 
sharing architecture, for the condition of 0.9987 
availability probability, as compared with a con- 
ventional architecture using one processor per 
data stream. It is seen that for Pp = 0.01, a 
savings of 96 to 98.7 percent is achieved. Even 
for the relatively high Pp = 0.05, a savings of 
89 to 94.3 percent is still achieved. 


Computer Simulation 


A detailed computer simulation was run to 
test the theoretical predictions of savings in 
processors. The radar example was used because 
sophisticated mathematic models exist for most 
aspects of radar detection. In the simulation, 
the detection performance of the processor shar- 
ing architecture with a limited number of proces- 
sors was compared with that of a system using a 
processor for every data stream. This reference 
system used M out of N detection processors which 
are easily analyzed mathematically, and are ap+ 
proximately equivalent in performance to the 
sequential observer [1]. 


A great many cases were run. The volume of 
data obtained and the complexity of the program 
is too great to permit more than a brief summary 
in this paper. More data and a detailed descrip- 
tion of the computer program are given in Refer- 
ence [1]. In the example below, as in all of the 
cases run, the results were quite constant with 
the theoretical predictions. 


The basic approach of the computer program 
was to compute the signal strength required for 
detection by the processor sharing architecture 
under the specified conditions, and to also com- 
pute the signal strength required by the compar- 
able conventional approach with a processor for 
every data stream. The additional signal strength, 
if any, required by the processor sharing configu- 
ration was used as the measure of..loss incurred by 
the processor sharing. Signal strength was speci- 
fied as a signal-to-noise ratio (S/N) and measured 
in decibels (dB). The major input parameters were: 
the number of data streams, the number of proces- 
sors, the required probability of detection (Pp), 
and the acceptable level of false alarms (Pra) 


The program first determined the appropriate 
parameters for the sequential observer. It then 
computed the stationary probabilities for each 
state under noise alone, that is the average per- 
centage of the time the machine would bein a given 
state. (An analytic method for this computation is 
derived in Reference [1].) These probabilities 
were used to compute how often there was a proces- 
sor available when needed by a data stream becom- 
ing active. An iterative method was then used to 
find the S/N required to achieve the desired P_, 
with the acceptable Pra- typically 10-15 iterations 
reduced the error to an extremely small level. The 
computation of the required S/N for the reference 
M out of N detector also used an iterative method. 
Basically it involved finding the required proba- 
bility for a single trial that corresponds to a 
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TABLE 1. PROCESSOR REQUIREMENTS FOR CASE WITH Pa = 0.01 


D 
Number of 
Data Streams 


B Standard 


Mean Deviation 


100 
500 
1,000 
5,000 


10,000 


i °3 Availability Probability 


Number of Processors 


R 
P 


Percentage of 
Processors for 
0.9987 Case 


Percentage 
Saving 


0.84 0.977 0.9987 


2 3 4 | 

7 9 12 2.4 
13 16 20 2.0 
57 64 71 1.4 
110 120 130 133 


TABLE 2. PROCESSOR REQUIREMENTS FOR CASE WITH Pa = 0.05 


0 


D B 
Number of YB Standard 
Data Streams Mean Deviation 


16 


22 


Np 


Number of Processors R 


P 
Availability Probability | Percentage of 


Processors for Percentage 
0.84 0.977 0.9987 0.9987 Case Saving 


7 9 11 11.0 89.0 
30 35 40 8.0 92.0 
3/7 64 71 7.0 93.0 

266 282 298 6.0 94.0 


522 544 566 5.7 94.3 


desired total probability of the cumulative bino- 
mial distribution. That is find p such that: 


M 


j N-j N! 
a —__———— = RP 
> po (lL - p) ” Sracgyr = Pp 
j= 
Table 3 summarizes the results for a number 


of runs with various numbers of processors and 
P_'s, for conditions of 1000 data streams and 


D 

-8 See : 
Poa = PrA = 10 -. For these conditions Np is 
found to be 10 and oO, is 3. Thus, for a confi- 


B 
dence level of two Ops 16 processors would be re- 


quired. It is seen that the losses decrease with 
the number of processors, and that with 16 pro- 
cessors the loss does become minimal, agreeing 
with the prediction. 


TABLE 3. S/N LOSS (dB) FOR VARIOUS Py AND 
NUMBER OF PROCESSORS. 10 LOOKS, 
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(— indicates that the P. cannot be attained.) 


D 


(Note, a 0.1 dB loss corresponds to requiring a 
signal 1.023 times as strong as the reference, a 
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negligible difference in almost all cases.) This 
example thus corresponds to a savings in pro- 
cessors of 98.4 percent, over the 1000 processors 


required by a conventional architecture. 


Summary of Conclusions 


An architecture has been presented for a 
class of problems involving parallel "sparse" 
data streams. This architecture achieves a dra- 
matic reduction in the number of processors re- 
quired, as compared with conventional architec- 
tures. Analysis indicated that savings of over 90 
percent of the processors would result in typical 
cases. These predictions were confirmed by a 
computer simulation for a representative set of 
problems. 


These savings are important not only for the 
direct savings in hardware cost, but also because 
they result in an increase in system reliability 
and a reduction in size, weight, and power con- 
sumption. For space and airborne systems these 
latter factors would be far more important than 
just a savings of money. It is therefore, recom- 
mended that the processor-sharing architecture be 
considered for as many parallel processing appli- 
cations as possible. 
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A MULTIPROCESSOR FOR CONTINUOUS SYSTEM SIMULATION 


E. Pearse O'Grady 
Department of Electrical and Computer Engineering 
Arizona State University 
Tempe, Arizona 85281 


Summary 


A simulation-oriented multiprocessor computer 
system employing a new concept for interprocessor 
communication is described. The communication 
scheme takes advantage of the iterative computing 
requirements of simulation and related applications 
and employs parallelism in conjunction with ad- 
dress-mapping memories to realize an efficient, 
high-speed interprocessor transfer mechanism. 
Multiprocessor operation consists of a sequence of 
compute and data-exchange phases. A compute phase 
is a relatively long period during which all pro- 
cessing elements compute without requiring inter- 
processor data transfers. A data-exchange phase 
is a period during which all processing elements 
are halted and interprocessor transfers are carried 
out according to a preordained plan. 


The multiprocessor includes a host processor, 
a system control processor with user console, an 
N x N array of processing elements interconnected 
through N horizontal buses and N vertical buses, 
2 x N bus control processors, and an input/output 
processor. The hast processor is a multiprogrammed 
general-purpose computer which provides program 
development and utility functions such as editing, 
file management, and language translation. The 
system control processor is a dedicated general- 
purpose computer which acts as a run-time executive 
for the simulation system. Its functions are to 
control problem set-up, execution, and termination 
and provide an interface between the user and the 
system during execution of a simulation study. 


A processing element is a high-speed bit slice 
microprocessor featuring simulation-oriented and 
multiprocessor-oriented operations. The main mem- 
ory consists of a small high-speed segment, called 
the transfer memory, and a larger segment called 
the local memory. The processing element's program 
is stored in local memory along with local problem 
variables. The transfer memory holds variables 
which are involved in interprocessor data trans- 
fers. During interprocessor data transfer opera- 
tions, the processing element is halted and the 
transfer memory is connected to either the horizon- 
tal bus or the vertical bus associated with the 
processor. 


A bus control processor consists of an address 
Sequencer (microprogram-control-unit chip) and a 
memory which is loaded by the system control pro- 
cessor. The bus control processor fetches bus 
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commands from its memory and either executes them 
internally or issues them to all processing ele- 
ments on its bus. There are three types of bus 
commands specifying either an interprocessor data 
transfer, a multiprocessor-oriented operation, or 
an internal bus-control-processor operation. 


Processing elements in the N x N array ex- 
change data through N horizontal buses and N verti- 
cal buses. A processing element interfaces with 
the horizontal bus and the vertical bus associated 
with its row and column through its transfer memory 
and a memory switch. The memory switch functions 
as a bus-command decoder and as a three-way switch 
which connects the transfer memory to the process- 
ing element or to one of the two buses. When 
connected to a bus the transfer memory participates 
in interprocessor transfers with other processing 
elements on that bus. 


A bus control processor executes an interpro- 
cessor transfer by issuing a bus command which 
Specifies a source processor, up to (N-1) destina- 
tion processors, and one address. A mapping mem- 
ory in each memory switch translates the address 
to independent addresses in the source processor 
and each destination processor, allowing efficient 
use of limited transfer memory. Transferring a 
Single data item between arbitrary source and 
destination processors requires at most a horizon- 
tal-bus transfer from from the source processor's 
transfer memory to the transfer memory at the 
intersection of the source row and the destination 
column followed by a vertical-bus transfer from the 
transfer memory at the intersection to the destina- 
tion processor's transfer memory. During a data- 
exchange phase, interprocessor data transfers dis- 
tribute data values to all processors which need 
them. First the N horizontal buses operate in 
parallel to perform all horizontal transfers. Then 
the N vertical buses operate in parallel to perform 
all vertical transfers. In some problems this 
scheme can improve effective interprocessor trans- 
fer rates by a factor as great as N when compared 
to a broadcast scheme [1]. 
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A MODULAR MULTI-MICROPROCESSOR ORIENTED FOR REAL-TIME CONTROLS 


M.Coppo and 


A.Giordana 


Istituto di Scienza dell' Informazione, Universita di Torino 
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Summary 


Modularity and reconfigurability, besides the use 
of standard well known components, make the system 
descrbed in this paper suitable for many applica- 
tions, expecially those of real-time control C1]. 
It was designed in particular, as a development sys 
tem. and in-field prototype for automotive engine 
control. The system has been implemented at a FIAT 
Research Center (CRF) 


Fig.l gives a sketch of the system which contains 
a set (form 1 to 6) of (physical and logical) pro- 
cessor modules and a common memory (CM) module,con 
nected together by a common-bus (similar to Intel's 
MULTIBUS) shared following a rotating priority stra 
tegy with hand-shake protocol. 

All processor modules are equal except one, which 
is privileged and will be called the Supervisor mo 
dule. Each processor module is organized around an 
internal bus and is composed of the CPU (Intel 8085 
in our implementation), two memory segments defined 
as "exclusive" and "inclusive", and two I/O sets 
defined in the same way. Inclusive memories (I/Os) 
have addresses form the byte 8N K to 8(N+1) K for 
module N (for I/Os, from 32N to 32(N+1), and are 
addressable (in a fully transparent way, for both 
memory and I/O operations) by the respective CPU 
and by the Supervisor (module-l) with the same add 
ress set. Alsegment of 8 K-bytes (32 address for I/O) 
with addresses from 56 K to 64 K(224 to 255) is the 
common memory (common I/0) of the system. Exclusive 
memory (1/0) can have in each module any address 
not assigned to the common memory (1/0) or to inclu 
sive memory (1/0) of the same module. In our imple 
mentation, exclusive memories (I/Os) have addresses 
from O to 8 K (0 to 31) for each module, and O to 
16 K (O to 63) for the Supervisor . 
mories plus common memory can be extended if the 
modules are less than 6; but they are limited to 


COMMON Bus 
ARBITER 
wo 
204 255 


Inclusive me- 


coMMON 


MEMORY 
($6- 64)K 


a ee, ARBITRATION TINES 
[= OS = an | [pe ee poe 
ae | | face | 
ARBITRATION ARBITRATION 
| 3 sees hee Leda 
5 | | Logic | 
a 
| EXCLUSIVE | | IW ELUSIVE = 
4 f 
- Vo < EXCLUSIVE eis | vo inCLUSiVE | 
|= 2 WeEmORY tte MEMORY 
: | ; 
| EXCLUSIVE w sald | Reinier (48-56)K | 
ue ; |! ED ares | 
| o-4H 2 Hare ee oe ee ee en exec sine 
| | MEMORY | 
| (0-8) | 4 (0- 81k | 
MODULE-7 (SUPERVISOR) ‘2 MooULe | 
Be Fes ee araaee ee e ey Be ie i, he Ne, wad 
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constitute, in their maximal size, a continuous set 
from 16 K to 64 K. Inclusive I/O facilities can be 
very useful for fault detection and recovery, while 
common ones can allow all CPUs to share common re- 
sources like bulk memory or arithmetic units. 
Inclusive memories of all modules can be accessed 
by the Supervisor which appears as though it were 

a DMA device. The arbitration logic has been design 
ed to avoid any possible dead-lock condition. 

All modules are provided with an addressable swilch 
which, acting on the bus arbitration logic, allows 
them to hold common memory in an exclusive way. 
This allows the realization of "Critical Regions" 
The Supervisor is also provided with a set of ad- 
dressable switches which allow it to stop any other 
CPU (possibly all together) and use any inclusive 
memory segment (or I/O) as if it were the only pro 
cessor in the system. 


The system built at CRF has 3 processor modules 
(including the Supervisor) and a common memory mo- 
dule of 8 K, A measure of this multiprocessor per- 
formance is shown in fig.2. Let the common-bus ac~ 
cess rate of a program be the ratio between the cy 
cles performed in external memory and the total num 
ber of cycles. In fig.2,T,,T TT. are the exe- 
cution times of an (arbitrary) program P, without 
syncronization requirements executed by processor 
Ml, measured in the following conditions: 
~ : external access of P in CM; M2,M3 are idle. 
Ym: as before, but M2 and M3 execute programs with 

the same access rate as P but relatively randomized 
tas as before, but external accesses of P are in 
inclusive memory of M2(M1 is the Supervisor). 
Y, is the execution time of P with no accesses via 
the common-bus. Note that, owing to efficient bus 
sharing, t,/%,is lower than 3 for 100% access rate. 
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SOLVING BANDED TRIANGULAR SYSTEMS ON PIPELINED MACHINES 
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Urbana, Illinois 


Abstract -- A new algorithm for solution of 
banded triangular systems on pipelined machines 
is described. The model of a pipelined machine 
consists of two functional units, a multiplica- 
tion pipe and an addition pipe, chained together. 
Each of the functional units is pipelined in s 
Stages. The time bounds, speedup, efficiency, 
and utilization for such a model are developed 
and compared to those of a parallel machine. 
Furthermore, the performance of the given algo- 
rithm is compared to the common "row sweep" 
algorithm for pipelined machines. 


1. Introduction 


Banded triangular systems or linear recur- 
rence systems originate from simple DO loops like 
the following one given in FORTRAN: 


po 10 I=1, 150 
X(I) = A(T) + B(Z) X(I-1) + C(I) X(I-3) 
10 CONTINUE 


There are only a few published methods on how to 
solve this kind of system on either high-speed 
supercomputers or smaller array machines intended 
for the signal-processing market. Although, all 
of these machines are designed to execute the 
vector inner product very efficiently, it seems 
that little attention has been paid to incorpo- 
rate linear recurrence systems into design. 


In this paper an algorithm for the fast 
solution of linear recurrence systems is pre- 
sented. The same algorithm in two different 
implementations can be used on either parallel or 
pipelined computers. 


Parallel linear recurrence solvers have been 
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discussed previously [1-4, 6-8]. The design of 
special-purpose hardware to solve recurrence 
problems was discussed in [5]. This paper con- 
siders the converse of this problem: Given an 
ordinary machine with pipelined arithmetic units, 
what is the speedup and pipe utilization obtained 
in solving linear recurrence systems? 


In Section 2, the definitions and notations 
used throughout this paper are presented. In- 
stead of using the standard linear-algebra 
approach, a definition of linear recurrences that 
facilitates a geometric representation of the 
algorithm and simplifies the derivation of time 
bounds is used. Then the Main Theorem (Theorem 
1), which forms the basis of our algorithm, is 
proved. 


A model of a pipelined computer is described 
in Section 3. The model (Fig. 1) is made simple 
enough to cover a wide spectrum of existing 


MEM 3 
MM 
MEM1 MEM2 


ADDITION 
PIPE 


a Ss ie ae he i ee ee 


Fig. 1. A model of a pipelined computer 


machines (from Floating-Point Systems AP-120B to the vector of variables with initial vector 


TI ASC, CDC STAR-100 and CRAY-1). It consists of ies = (Xo se eae 1) specified in 
two pipelined arithmetic units, a multiplier and e 
an adder chained together into a multifunction advance. a 


pipe in a static configuration. Each arithmetic 
unit in the model has s stages and requires one 
time step for a single operation, although it can 


Furthermore, we will need a set of mtl unit 
vectors e, = (e,> Ens ses e tb where for all 


deliver s results in one time step in the vector j, l<4<m+4+i1 
mode. ces 
_ 0 if i # j 

To compare the performances of different ma- © 5 7 gh et oo . (2) 
chines, which may be parallel, pipelined or simply 1 if i= j 
sequential, the speedup, efficiency and utiliza- If a row (column) of any matrix is equivalent to 
tion must be defined using attributes that charac- a unit vector, it is called a trivial row (col- 
terize all machines. Such an attribute is the umn). 


computational rate av (in operations/second) which 


can be sustained by the machine M in an ideal Theorem 1 


computation. On the other hand, an algorithm A is 


characterized by O(A), the number of operations Any R<n,m> system can be written in the 


needed to execute A. The computational time of an form 
algorithm A on a machine M is denoted by T,,(A). T 
M x, = b, * Xo (3) 
Note that TSA) 2 0(A)/r,- Therefore, for any two = z 
machines M, and M, and any two algorithms A. and where for alli, l<i<n, b = (b,4> bios oe 
Ay, we can define speedup S = Ty, (4g) /Ty (Ay) bow d.) is defined recursively as follows: 
° e ° = ‘ 7 
Ee (ry * Ty AQo/Gy * Ty A.)), utilization a a ee (4) 
i i j 
U = 0(A,)/r - T (A_) and redundancy R = O0(A_)/ ‘ . = 
k M, M, 1 A. with matrix Baa (b,_4> bio Siren be? 
O(A,). This is a generalization of the notation T 
Q & ep , and bos blip Sarat b_atl equal to 21> Lo» 
found in [1]. es Sages 1s 


—m 


Section 3 concludes the paper by comparison 
of results for pipelined and parallel machines. 
The pipelined machine with two functional units ; 
and s stages per unit has almost 2 times better (Basis) 
performance for large bands than a parallel ma- Eoido ee 
chine with s processors where each processor is eh eG a 
capable of executing any arithmetic operation in 
one step. Although the hardware cost of pipe- 
lining is considerably smaller than the cost of by def. of I. matrix 
parallelizing, it is impossible to increase the " ; ‘re i 
performance indefinitely by adding extra stages, ap ede ee +1 emt? ) Xo 
Since the number of gate levels in an implementa- by def. of b, 
tion of a floating-point add or multiply is fixed. ~a 


Proof (by mathematical induction on i) 


Ty. T 
* * 
Ray AE Pepe Se) 2G 


I 
“o~ 
Ah) 
4 
Ce 
ao 
om 


In present-day machines, s is between 2 and 8. = b,x, by def. of b, 
| — =]. 
Furthermore, the common "row sweep" algorithm (Induction step) 
tends to perform better for small s and large m 
than the algorithm presented in this paper. x =gq _%*x by Def. 1 
a ee Ree! i GA Actes B 
2. Main Theorem = * Ty7 
SSS eee Bey: ((b, sb. ata sbi td 4) Xp) 
Definition 1 An m-th order linear recurrence by hypothesis 
system R<n,m> is the set of n equations . 
= (a, ,_*(b.,b. Seo iat e )) x 
x. =a. * xe : Tee (1) —i+] ‘-i’-i-1’ >—f —mtl ?—mt1 —0 
1 i irl Boy aes by associativity 
where * denotes matrix multiplication and a) ‘ 
= | i a = 
a, (as4> aso> ees as. c,) is the vector of = Dig, by def. of b, 
coefficients, X41 = (x, a> Kegs sees Kas 1) is m 
(a) ; Theorem 1 shows that any set of consecutive 
Throughout this paper any vector x is a row variables {x,]3 < i < k} can be computed in par- 
ae | ; pee ae 
vector and x is the corresponding column vector. allel, that is, independently of each other, if 
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vectors of parallel coefficients b,, j < i< k, 
are generated and substituted for ass j Sa < e 
The computation of all b. is the overhead that 
must be paid for parallelism. The following ex- 


ample should clarify this idea. 


Let us consider linear recurrence system 


R<9,2>. Using standard notation of linear algebra 


R<9,2> can be written as 


1 
-a54 1 

“839 43, 1 

a. Fg 
Seo 85 
a a 
“ag SS. 
“Spo “eu 

; = gp" 7 25% 


1 
= 
“agp. “Agy 
“Big. “Ay 2 
Pag ey 
“Peo P61 : 
BP. aa : 
Bg: “Pei : 
“#92 “#91 
where 


10 0 
[Deoyebsoedg] = fagysa552c5] * | 0 1 0 


001 
= [a54sa59°Cs] 
bey B59 4s 
= * 
[bey Peaedg] = Lag ssagocg] = we 8 
0 Oo 1 


= ag, 45748692461 a59°461 c5tCg! 
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i 1 
ie” "2 
*3 “3 
* 1 “4 
*s 4) | SS 
xe mG 
a a 
ae Ge 
X9 uae 
i nd: 
2 2 
*3 “3 
*4 “4 
a ee 
se a6 
“7 ey 
Xg dg 
*9 “g 

7 

LbpysPyoed7] = [azyrazosc7] * | bey bey 

0 0 1 

= [877 (461 8537869) + a70 a5y> 
7 Pet 59 79-5 e 
471 (a6) Cgttg) + az, Cotes] 


= * 
gi?Pgaedg] = Lag, +4go2g] el. eg: “6 


0 0 1 


a..ta.,) ta 


= [ag (a7, (4g) 45; t469 


72 351) 
+ ago (ag) 451 *8 69) 

4316471 961 959747252) 
Paes 8 5a eos 


+c 


ag (477 (ag, este) 


+ azo C5) + ayy Cs 


+ Ag, (agy cote.) + Ce] 
The above coefficients can be obtained by 
direct substitution of x, into x Xe and Xe into 


> 6° 


Xs and finally Xoo Xs and Xo into Xe The new 


transformed matrix allows parallel computation of 
Kes Xoo Xs and Xo since they do not depend on 


each other anymore. It seems that nothing has 
been accomplished since computation of b.; 5 <i 


< 8, must be executed serially, and it requires 
more time than the original one. However, the 
generation of bos 5 < i < 8, is independent of any 


other subset of variables whose intersection with 
{x,|5 < i < 8} is empty, and therefore, the gener- 


ation of b.'s for any two disjoint subsets of 


variables can be performed concurrently. 
Corollary 1 


In any R<n,m> the computation of b = as * 


B._y> 1s isn, requires at most m(m+l) multipli- 
cations and (m-1)(mt+1) + 1 = a” additions. Fur- 
T 
° = * < 2 
thermore, the computation of Xs b, Xo> 1S 2 
<n requires at most m multiplications and m addi- 
tions. = 


However, b, does not require any computation, 


1 


by requires only mtl multiplications and m addi- 


tions, and so on. The first vector that requires 
m(m+1) multiplications and m* additions is Le 


i <i< 
All vectors after Lan (that is, any b> mtl < i< 


n) require the same number of multiplications and 
additions. Therefore, when computing the total 
number of operations needed to compute all b,, 1< 


i <n, in any R<n,m>, two different formulas must 
be used: one for the case when n < mtl and the 
other for n > mtl. 

Corollary 2 


In any R<n,m> the computation of all b. =a 


* Boy 1 < i< n, requires no more than 
(n - $(t1)) m(m+1) if n > mt 
K (a) = : (5) 
3 n(n-1) (mt1) if n < ml 
multiplications, and 
(n - $(mt1)) ie if n > ml 
K, (a) = (6) 
$ n(n-1) m if n < ml 


additions. 


Furthermore, the total number of operations 
required is equal to 


(n - met) (2m + m) 


if n > mtl 
K(n) = K tk, (a) = (7) 


n(n-1) (2m + 1) 


if n < mtl 


Nl 


Proof 


For all i, 1 < i < mtl, B._ contains only 


1 


i-1 nontrivial rows, so that b, =a. °« B., 
= fk. el 


requires only (i-1)(mtl) multiplications. There- 
fore, for all n < ml 


n 
y (i-1) (tl) 


K (n) 
e i=l 


n 
(m+1) <£ (i-1) 
i=1 


(mt1) Bort) (8) 


On the other hand, for n > ml, Kn consists of 
two parts. Using (8), the number of multiplica- 
tions to compute all b> 1 < i<m, is equal to 
$ (+L) m(m-1). Each consecutive b,, mhl<i<n 


requires m(mtl) multiplications by Corollary 1. 
Therefore, for all n > ml 


KS) = (n-m) m(m+l) + $(mt1) m(m-1) 


n m(mtl) - ae Chet + S(mt1) m(m-1) 


n m(mtl) - + atti 


Now consider K,(@) when n < mtl. Each non- 


trivial row of BL » not counting the first, re- 


1 
quires m+l additions and each trivial row only 
one addition under the assumption that the first 


row in B._y is nontrivial. If the first row is 


trivial, then i = 1 and Bay = By is an identity 


matrix. No additions are required in computation 
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= * = 
of b, a, * BB = ay Therefore, for all n < ml 
n 
Km) = Y [(i-2)(m+1) + (mt1) - (i-1)] 
i=2 
n 
= Y m(i-1) 
i=2 
= 5 m(n-1) 


Similarly, Km), n > mtl, consists of two parts 
. ° ° 2 ° e 

since each b,, m+l1 > i> n, requires m additions 

by Corollary 1 and the computation of remaining 


b> 1 <i<m, altogether requires 4 in? (m1) addi- 


2 


tions. Thus 
K (n) = (n-m) a we ia Ge 
a 2 
=n ar -4 ii GD 


3. Pipelined-Processor System 


A pipelined-processor model consists of Main 
Memory (MM), Pipelined Processor (PP) and Control 
Unit (CU). PP may consist of several pipelined 
functional units. In our model, we shall assume 
only 2 functional units: multiplication pipe and 
addition pipe. Each of them has s stages. These 
two functional units are connected serially with 
the multiplication pipe feeding the add pipe 
through register Rl. The results from the add 
pipe are either stored in MM or fed back into the 
add pipe through register R2 (Fig. 1). 


Algorithm 1 


Given a linear recurrence system R<n,m>, 


cpm, 
Execute the following algo- 


divide R<n,m> into [n/p] subsystems R 
where 1 < j < [n/pl. 


isc mei Ee | 


Algorithm 1 to the system R<28,2> is shown in 
Fig. 2. 


In general, a square with the inverse diag- 


onal symbolizes computation of be = ae gS), 


1<i<4, 1< j< 7. The input on the top of 
the square represents a, and the output on the 


bottom b.. The horizontal input represents the 

m ix B, whose rows are vectors b. BD. 
atrix B._,, wh Pq? 429? 
very @ ae The dots on the horizontal line indi- 
cate the vectors that are rows in B._4- The unit 

vector e which is always the last row of any 

= 
matrix BL is not connected for clarity. The 


L 
closest dot to the square represents the top row 
of B._y that is, boa The natural ordering of 


the rows in the matrix is established by prox- 
imity to the square. Similarly, the empty square 


represents the computation of xh) 7 be? = 
(j) 


x3-)) with vertical input being b;>~°, vertical 
os (4) = 
J , and the horizontal input repre- 


output x, 
here (j-1) 
senting vector x ‘ 
—?p 
from MM and computing coefficients and variables 
in PP is indicated by dotted arrows. For ex- 


ample, the computation of bi, requires a,/, bi3 


and ey to be fetched from MM and entered into PP. 


It is impossible to proceed with computation of 


ays since bay will become available only after a 


certain number of time steps. Therefore, in 
order to keep pipelined functional units busy, 


Algorithm 1 proceeds by computing bi be» Ky» 
and b 


bop big: At that moment of time, 


bia should be available at the output of PP to be 


used in computation of b, 


The order of fetching 


5° The exact sequence 
of statements to be executed in this case is 


le given below: 
1. begin 
2s for i: = 0 until m-1 do ayo (0, -+50,x_4)3 
oe for i: =m until es ~ 1 do aw null; 
7 _ + 2 e — e 
4, for i: = 0 until p 1 do avait? null; 
5 for j: = 1 until [n/p] + p do 
6. begin 
7 By = identity matrix; 
ee fs , Gi-krl).. « ~<GSREL (j-k+1), 
8 for k: = 1 until p do be : = ay Be 3 
a , (J=P)) 24 CID) 2. p=), 
Os for k: 1 until p do ven Deer ae 4 
10. end 
ll. end 


The lines 2, 3, and 4 in Algorithm 1 are used 
to initialize boundary coefficients which are not 
included in R<n,m> , although they are referenced 
by the inner loop in lines 5-9. An application of 
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G25 927 Gae Gas G24Go3 22>, 


Boss. 
ES N\ 
_s 
XogXo7 Xag Xs Xoq Xa3 Xan Xo, Xo9 Xig Xig X17 Xig X15 Xiq Xi3 Xio XxX, Kio Ky X, Xz Xo Xy Xo X_y 
Fig. 2. Solution of R<28,2> on a pipeiined processor 

oe aes ae Pig 7 Steer 1g geo 
14,2 es Fes Pi3.o 182 = ae a a 
Cie 41444 443 a Sie:  Biger 19 eee: 
bay. = 242,12 10,1 + 212,2 P91 Ses Se a a es 
a ae ae Piper “Teh. ey ise 18.2 

, 147 °10,9°7 19,9 P92 us -, - . 
i Sc 15,1 14 pe ORS ne 
Bey. “Stig By 482 Seep bes PI 99 ae FAD, 16 
bey SS aay yy Pa By Pio A eo io 1 
do Baas ay tag, dg + ey 0 = ek tC 
2 Ss ea go ge Sy Mg Da Oy + Bao X3 bees 
Ky = bay Xp Pippa a: wal, oo Ua ty + dy» x, +d, 
Xo = bod Xo + boo xy + d, X¢ = bed x), + bes X, + de 
x4 = bi Xo + bio xy + d, Xs = bey xy + bey xX, + d. 


The next theorem is based on observation 
that there are at most m+] additions and mtl 
multiplications per statement. Multiplication by 
1 and addition to O are included. 
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Theorem 2 

Given an integer p > 0, any linear recurrence 
system R<n,m> can be solved in 
qs C + p)(m + 2) time steps using pipelined 
functional units with s = ((m + 2) p - (m+ 1)) 
stages. 


Proof 


The proof is based on Algorithm 1. The sys- 


tem R<n,m> is divided into subsystems Red ote, 

1 < 4 < [n/p]. To keep pipelined units busy 
almost all the time pt+l subsystems are being 
solved concurrently; that is, different subsystems 
are using different stages of the pipelined units. 


Each gi) p,m system is solved in two parts: 


pi)? = at? 3 Bu). 1 <k< p, is computed first 
Ci). eg: , 4 aad) Z 
and aru ie) Pepi a » 1<k<p, is eval: 


uated afterwards. With p+l systems computed con- 
currently, the order of computation is as follows 
cat Bh OO oc BIRR. go GHB) Gp) 
—l * -2 ae P p-1 
(i-p). 4 Gtl 4G) (j-pt2) _(j-ptl) 
oe PT, DIT, By wees BY » x, 
x (J-Ptl) 


 (J-Pt)) 
p-1 9 1 > 


oe 


> ee 


Therefore, at least (k-1) b,'s and k x's 


must be computed before their values are used in 
subsequent computations. Each b, is a vector 


containing mt+l elements. Each element of by is 


computed as a sum-of-products (sop). Similarly, 
each x is a sop. Hence, there are N = (m1) 


(p-1) + p sop's altogether. Each sop has mtl 
products at most. This implies that multiplica- 
tion and addition pipes are used mtl times for 
each sop. The bottleneck of the pipelined proces- 
sor is obviously addition pipe since a new product 
can be added to the corresponding partial sum only 
after one addition-pipe step, i.e., after s stage 
delays. Therefore, m+l addition steps and one 
multiplication step are needed to obtain p solu- 
tions of R<n,m>. Total time needed to compute 
R<n,m> is less than C + p)(m + 2) pipeline 


processor steps. The number of stages s in the 
addition pipe must be equal to N. This way, a 
partial sum for some sop and a new product to be 
added to that partial sum are generated at the 
same time by addition and multiplication pipes, 
respectively, and are ready to be entered into the 


addition pipe again. = 


As an example, the pipelining of the above 
sequence of statements starting with computation 


of bi, through the pipelined processor model of 


Fig. 1 is given in Fig. 3. The memory output 
ports, MEM 1 and MEM 2, deliver two operands on 
each clock. An empty entry in Fig. 3 denotes the 


zero operand. The memory input port MEM 3 is 
independent of the two memory output ports. The 


314 


critical point is the computation of d, which 


8 
is stored in clock 61 and fetched from memory in 
clock 62. 


It is worth noting the desirability of a 
large p < Vn (the function Sits p)(m + 2) has a 


minimum at p = vn). However, a large p requires 
a large number of stages in pipelined functional 
units. Since the number of gate levels in an 
implementation of floating-point multiply or add 
is fixed, the number of stages s will hardly 
exceed 10-15. In present-day machines s is 
between 2 and 8. 


Secondly, increased s increases the cost of 
functional units since extra latches or registers 
have to be added which in turn introduces extra 
delay, reducing the effect of pipelining. 


The problem of finding optimal number of 
stages s for a given p was answered by Theorem 3. 
In reality, pipelined processors have fixed s and 
the problem is in finding minimal TS Without 


loss of generality, we shall assume that all 
functional units have the same number of stages. 


Corollary 3 
Given a pipelined processor with s-stage 
functional units, any R<n,m> can be computed in 


(m+2) p" -— (m1) 


ase oa : 
fo OC ee : 


Cu +p") (m + 2) 


where .p' = Ss and p" = [ote] 
m+2 m+2 


Proof 


For a given p, N = (m+2) p - (ml). If s5= 
N, all stages are busy performing some computa- 
tion. If s > N, exactly s - N stages are idling 
and waiting for the partial sums to become avail- 
able at the output of the addition pipe. The 
time tT; required to compute R<n,m> is still the 
same, C + p)(m + 2). However, the effectiveness 
E, = T/s has decreased. On the other hand, if 


s < N there is no idling, but more pipeline time 
steps are required to compute R<n,m>. Conse- 


N ,n 
quently T ee = p)(m + 2). 


For a given s, it is possible to choose p 
such that s > N (p < (stm+1)/(m+2)) or s < N 
(p > (stmt1)/(m+2)). In either case a p that 
minimizes T. is sought. In the former case, the 


function Ce p)(m + 2) is monotonically de- 


creasing for positive p's, p < Yn, with the 
minimum at p = Yn. Assuming p<<n, the minimum 
time is obtained for the largest p such that p < 
(stmt+1)/(m+2). When s < N the function 


ae p)(m + 2) is monotonically increasing for 


positive p's, and therefore the smallest p, p > 


x 
O 
° (stm+1)/(m+2) requires minimum time to compute 
% R<n,m>. 
1 m 
2 
3 ° e e 
4 To obtain a qualitative assessment of Algo- 
2 rithm 1, we shall obtain the time needed to 
7 compute R<n,m> serially on a pipelined machine. 
8 A simple observation of x, = a., X, Fee So A 
9 i il “i~1l im 
10 Be +c. shows that after x, has been computed, 
1 i-m a i-1 
12 it should take theoretically at most 1 multipli- 
13 cation and 1 addition to compute Xs since the 
14 ; ; 
15 remainder of the expression could be computed 
16 ahead of time. Therefore, as long as the number 
at of stages s is smaller than m/2, the straight- 
19 forward computation will keep the multiplication 
20 and addition pipes saturated and give the best 
. possible speedup asymptotically. 
23 
24 Lemma 1 
25 
ee Any R<n,m> can be computed on a pipelined 
os processor with s-stage functional units in time 
29 7 
30 T <— (1 + max(2s,m)) + min(2s,m) 
3] Ss Ss 
= using the "row sweep" algorithm. dojns $ 
SREP in 
34 ‘' ' par 
35 Proof y< 
36 ——— a: 
37 : ; 
38 The proof is based on the assumption that a 
39 number cannot be written into the memory and read 
40 Qig 2 out in the same clock period. First, assume 2s < 
- m. Therefore, all variables are evaluated before 
43 A they are used, that is, before they are input 
44 ii ebes into the multiplication pipe. Counting a multi- 
os Fire dg tCyy plication by 1 as well as addition to 0, the 
47 O82 evaluation of each variable requires mtl multi- 
48 plications and mt+l additions. Therefore, the 
the time to compute R<n,m> is ty + n(m+l)/s where ty 
S is the startup time. To calculate ty» we observe 
53 that s variables are evaluated in mtl time steps. 
54 However, the first variable requires m+3 steps 
= and each of s-l1 variables can be computed in 2 
57 additional steps afterwards. Hence, 
58 
59 ie =m+3 + 2(s-1) = (m+1) = 2s 
60 
= The total computational time 
63 + 
64 te 2 deo if 2s < m. 
66 Finally, assume 2s > m. Since m < 2s, some pipe- 
ts line stages are idle during computation and it 
69 takes at least 2 time steps plus one clock period 
70 to evaluate a new variable. The time needed to 
ie compute R<n,m> is equal to ty t+ n(2s+1)/s where 
73 . 
74 startup time 
i to =m+ 3 + 2(s-1) - (2stl) = m. 
77 | 
78 Therefore, 
n(2st1) 
: oe ; ; <i i > m. 
Fig. 3. Pipeline diagram of R<28,2> T. aa Ss et 7 
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Two algorithms presented in this paper are 
compared. For simplicity, the "row sweep" algo- 
rithm is denoted by A and the parallel algorithm 
by A.: Note that A, is the parallel algorithm 


requiring the shortest computational time (deter- 
mined by Corollary 3) on our pipelined machine 
Mo» with s stages per arithmetic unit. M. is 


compared with a uniprocessor machine Mi that exe- 


The 
For sim- 


cutes any arithmetic operation in one step. 
ideal computational rate of M, is 2s. 


plicity, we shall denote Ty (A) and Ty (A) by T 
Ss 
and T', and Ty (A.) by TS: Obviously, T = 2mn for 
Ss 


any R<n,m>, and T/T. = 2mn/T .- The speedups 

(T/T) (1 + €,) and (T/T')(1 + Ey) for m= 1, 2, 4 
ag 

and 8 are shown in Fig. 4 with Ey —— 


s _min(2s,m) 


n 1 + max(2s,m)~ 
of Ey and E, are relatively small for large values 


where 


p e{p', p"} and Ey = The values 


of n. 


The comparison of speedups in Fig. 4 shows 
that the parallel algorithms are superior for all 
S > Sy, and that "row sweep" is better for 1 < s 
< So: 0 
large to be practical for banded triangular sys- 
tems with medium and large bands. For example, 
if a practical s = 15 is assumed, then the "row 
sweep" algorithm has the advantage for all m > 4. 


However, the cutoff point s, may be too 


Note that the "row sweep" algorithm has 
almost the best possible speedup of 2s for all 
s <m/2. For all s > m/2, the "row sweep" algo- 
rithm has approximately a constant speedup slowly 
approaching the value of m. This behavior is not 
surprising since the computation time T' is lim- 
ited at the beginning by the number of stages and 
later by the operation time, which is kept con- 
stant with respect to the number of stages per 
operation. 


The speedup of the parallel algorithms is 
much smaller than 2s. This can be explained with 
increased redundancy of A, with respect to A. 


The redundancy function (Fig. 5) is a step func- 
tion with steps corresponding to the set of all 
positive integers; that is, the first step corre- 
sponding to the parallel algorithm with p = 1, the 
second step to p = 2, and so on. When p changes 
from, say, k to k+l, there is a substantial in- 
crease in the number of redundant operations which 
is compensated eventually by the increased number 
of stages s. 


The breakpoints in speedup functions in 
Fig. 4 are easily explained with redundancy func- 
tions. For each step there is an optimal number 
of stages s for which the speedup is maximal. 
These optimal values of s are indicated by heavy 
dots in Fig. 4. For any s between the two optimal 
values Sy and S5 corresponding to parallel algo- 
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rithms A, and A. ,» the algorithm with better 

1 2 
speedup is used. As the number of stages in- 
creases from an optimal value Si> the algorithm 


A, generates the same speedup since the number 
1 

of operations stays the same and the operands do 

not flow faster through the pipelined arithmetic 

units. Since the number of stages is greater 

than S,> some of the stages are idle, waiting for 


the operands to become available from the bottom 
of the pipe. The utilization of the machine 
decreases (Fig. 5). The algorithm A. is used 

2 
when the increased number of stages is capable of 
compensating for increased redundancy of the 


algorithm. Since s is smaller than So» the num- 


ber of operations that can be performed indepen- 
dently is greater than the number of stages. 
There are no idle stages and the speedup is 
limited to s. As s increases, the speedup in- 
creases linearly with s until it reaches its 


maximum at So The utilization is constant in 


intervals of the speedup's linear increase. 


Another interesting result is a lower-than- 
expected speedup for banded systems with m = 1. 
Although the redundancy for m = 1 is low, the 
utilization is low, too, resulting in overall 
speedup that is below those for m = 2 and m= 4. 
The low utilization is the result of a very 
simple model which requires a multiplication by 1 
as well as addition to 0 to be considered as 
operations, while they were not counted as oper- 
ations in computation of redundancy. Since the 
percentage of these operations is high for m= 1, 
the utilization is very low. 


4. Conclusion 


A parallel algorithm for solving banded tri- 
angular systems on pipelined machines was de- 
veloped. The algorithm uses extra redundant 
operations to allow parallel computation in pipe- 
lined arithmetic units with s stages. The number 
of s stages is large enough to compensate for 
introduced redundancy and to achieve overall 
speedup with respect to a uniprocessor machine 
using the natural "row sweep" algorithm. When 
the "row sweep" algorithm is microprogrammed on 
the pipelined machine, the comparison shows that 
the parallel algorithm is superior whenever the 
number of stages s is greater than So: The 


break-even point So depends on the size of the 


band m, and increases with m. Even for medium m 
the break-even point may be too large to be 
practically implementable, since floating-point 
operations have a fixed number of gates and, 
therefore, they can be pipelined only to a cer- 
tain number of stages. 


The final comparison that remains to be 
made is between pipelined and parallel machines. 
In other words, what machine organization should 
we adopt if the only measure of performance is 


se nse "ROW SWEEP" ALGORITHM A 
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banded triangular system solvers? The pipelined 
machine with s stages per operation is compared 
to a SIMD parallel machine. Such a parallel model 
has p = s memory modules and p processors (arith- 
metic units). Similarly, to pipelined arithmetic 
units each processor requires one time step to 
generate the result from two operands for any 
arithmetic operation. However, the arithmetic is 
not pipelined and, therefore, only one result is 
generated in each time step. It is assumed fur- 
ther that no time is required to communicate data 
between processors and memories, and that the 
storage arrangement of data is irrelevant. 


The algorithm for a parallel machine was 
developed in [4] and the time necessary to com- 
pute a banded triangular system of size m was 
established to be 


oO 


SPEEDUP (Tp/T')(1+ €,) 
OR SPEEDUP (Tp /Ts 1+ €,) 


£ 
7 


era + 3m) - me Cinta) if p > m1: 
7, =/ P 2p | 
Pp 
1 1 1 
= sy Gee i < 
n(m + Nes a >) if p < ml 


As before, two algorithms--the "row sweep" algo- 
rithm and the parallel algorithm running on the 
pipeline machine--are compared to a parallel 
algorithm on a parallel machine. 


' 
The speedups (TL/T CG + E,) and (T/T ) 


(1 + E5) are plotted in Fig. 6. The pipeline ma- 


chine with the "row sweep" algorithm is inferior 
to the parallel machine for m = 1 and m= 2. For 
all m > 2, the parallel machine eventually 


—-—-—— "ROW SWEEP" ALGORITHM 
PARALLEL ALGORITHM 


— ee 
~— me ee eee 
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NUMBER OF PIPELINE STAGES 
OR NUMBER OF PARALLEL PROCESSORS 


Fig. 6. Speed comparison of pipeline and parallel models 
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succeeds to outperform the pipelined machine, 
when the number of processors is large enough. 
When the parallel algorithm is used on the pipe- 
lined machine, the situation is quite different. 
From m = 1 the parallel machine is 2 times faster 
than its pipelined counterpart, while for m = 2 
they are approximately equal. For all m 2, the 
pipelined model is performing better, but no more 
than 2 times better. 


It should be mentioned that the results are 
not very surprising since the ideal computational 
rate of the pipelined model is proportional to 2s 
while that of the parallel model is only pro- 
portional to p = s. In the pipelined model, we 
allowed two arithmetic operations (a multiplica- 
tion and an addition) to be executed at the same 
time, while only one operation per time step was 
allowed in each processor of the parallel model. 
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A PARALLEL/PIPELINE MULTIPROCESSOR ARCHITECTURE FOR 
SOLVING SYSTEMS OF LINEAR EQUATIONS 


William C. Liles and James W. Demmel 
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Santa Monica, California 


Summary 


Methods have been explored for solving large 
systems of simultaneous linear equations with positive 
definite Hermitian coefficient matrices. Our goal 
was to solve large systems quickly and efficiently, 
using a parallel/pipeline approach. We summarize 
here an algorithm based on Gram-Schmidt orthogonal- 
ization and its implementation on a simply and 
regularly connected processor array with simple 
processors. Given an nxn matrix A and m vectors 
B.> 1<i<m, our implementation can solve the m 
prgblems AX, = B,, 1<i<m, in time O0(m + n), using 
am /2 processors. This method may be used in 
least squares, regression, and adaptive signal 
processing problems, and variational problems such 
as solving self-adjoint differential equations 
with finite element methods. 


Let the system to be solved be AX = B, withAa 
positive definite Hermitian nxn matrix, B an nxl 
vector, and X an nxl vector of unknowns. If T is 
any nonsingular nxn matrix,then A_, = TAT* (T* = T 
conjugate transpose) is also_posifive definite 
Hermitian and,X = A B= T*A, TB. By choosing T 
so that T, A,” and T* are easily computable, we 
will have simplified our problem. This computation 
will be accomplished by the Gram-Schmidt orthogo- 
nalization process fi] (where <a, b> is a complex 
inner product): 


zi = B, = input 
J J 
i+l1 i i i i i 1 
k 7. = me >/< >) : 
Z, = Z; = output (Z = TB) 


The T provided by (*) is unit upper triangular and 
A, = diag (<Z., Z.>)_js diagonal. In fact, the 
decomposition A = T A, (T*) is just the Cholesky 
decomposition of A. 


The implementation of T is shown in Figure 1 


for n=4 (arrows indicate direction of data flow). 
Processor ae computes 
+] ° ‘ ‘ ’ fi ‘ 
ZS Ze Swe Zw & eZ ZB RZ, Z>) 
j 0 ij i - ij 5 i 2 eae: 3 


The n(n - 1)/2 processors work as follows: The 


processors in row i work simultaneously on 
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90403 


intermediate results passed to them by the row 
above andpass their results simultaneously on to the 
next row, P. .,,, broadcasting its value to the 

. n Aeegt Co lB j 
entire next tow. All processors work simulta- 
neously, each row working on the intermediate 
results of different input vectors being piped into 
the top of the processor array. The processors 
are simple, identical, independent of n, and are 
connected in a simple pattern with fixed unidirec- 
tional data paths. 


Figure 1. 


Implementation of T 


The w.,'S may be computed in two ways by the 
array in Figure 1, depending on whether the actual 
matrix A is known or if A must be computed as a 
sample covariance matrix of many sample vectors 
(typical least squares problem). Ana 4 is available 
in any processor row i. ‘ 


Multiplication by T* may be done by either 
the unit vector method or the reverse flow method. 
The unit vector method computes the entries of 
T* by passing standard unit vectors through the 
array and forming the product in vertically 
connected processors on the output ports of the 
array. This method solves the m problems AX, = B, 
in time O(mn) with high efficiency E = Processor 
on time/ (Total time *, Total processg?8) = 
(m/m +1) © {[(m+ 1)” - 2]/(. +1)°}. The reverse 
flow method passes the data backwards through the 
array, performing the computations in reverse. 
This method requires time 0(m +n) but has lower 
efficiency ERR = m/[m + 2(n + 1)]. 
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summary 
This paper presents pipelined multiplying 
arrays, which use macrocells bigger than the gated 
one bit full adder (GFA) discussed previously in 
these 
macrocells decreases the number of latching stages, 


the literature [1 e The introduction of 
and, in some cases, increases the gate number of 
the combinatorial part. Thus the total gate number 
of the resulting structure must be evaluated for 
every particular implementation of the cell.On the 
other hand the row and column number of the array 
are reduced, because each macrocell performs’ the 
same calculation for a larger number of bits than 
the GFA. Therefore the latency time and throughput 
are remarkably improved if the delay per stage, D, 
is opportunely limited; this can be obtained 
implementing the macrocell with a few gate level. 
In this paper the cost/operation parameter, as 
defined in [2 | » was selected in order to compare 
the our proposed arrays with the classical ones. 
Since the two logic level implementation of a 2x2 
bit full multiplier macrocell requires 119 gates , 
taking into account the results presented in 
another paper [3 » it may be seen that the 48x48 
bit maximally pipelined multiplying array, built 
with these macrocells, has a better cost/operation 
than the equivalent Guild array,provided that the 
less 
than 209. Thus, in this case, the solution proposed 


number of consecutive multiplications, M, is 


is better than the equivalent GFA array, in almost 
all. practical situations. 


To reduce the complexity of the array a 
second 2x2 bit full multiplier macrocell, shown 
in Fige 1, is introduced. The resulting structure 
performing 4x4 bit multiplication is shown in 
shown in fige 23 the cells enclosed by the dashed 
line are needed to reduce the output result in a 


1 So 0 a0 500 600 70 BOO OO M 
Fig.3.Cost/operation improvement 
obtained using macrocells 
in a 48x48 pipelined array 


Fig.1.Second type of macrocell: 
(a) logic symbol; 
(b) dot representation of 
the arithmetic function. 


binary number having only one digit for every 
weight. The two gate level implementation of the 
14 ‘gates 
respectively, so that every characteristic of the 
above mentioned 48x48 bit array is better than the 


two types of cell requires 82 and 


corresponding Guild array [1 | e We may define a 


gain parameter, y as follows: 


€1) =1-[ nao ino | 
where n QD is the cost/operation of the maximally 


pipelined macrocellular array performing the 48x48 
bit multiplication, and n.(M) is the cost/operation 
of the equivalent maximally pipelined Guild array. 
Fige3 shows the behavior of y vse Me This curve 
shows that the macrocellular array improves the 
cost/operation of the analogous Guild array, for 
every number of consecutive multiplications. These 
results confirm that the introduction,.of macrocells 
is a new and satisfactory way for implementing 
pipelined multiplying arrays. 
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MODELING MAXIMUM PARALLEL EXECUTIONS IN 
PIPELINE EXECUTABLE FORM USING PRECEDENCE EXPRESSIONS 


B. I. Dervisoglu 
Department of Electrical Engineering 
and Computer Science 
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Storrs, CT 06268 


summary 


A procedure is described [2] which transforms 
a computation structure [1] into an execution mo- 
del in "maximum parallel" form that also allows 
several executions to be pipelined through the 
Same structure. The procedure produces a set of 
expressions which are evaluated continuously by a 
controller to regulate the flow of control through 
the computation. A computation is initially re- 
presented as a sequence of steps, called a pro- 
gram. Each step identifies the source and des- 
tination variables for the data values and may 
contain braching information. Data operations are 
represented by some unidentified function F. Start 
and exit steps identify the input and output var- 
iables of the program. A sample program is shown 


below. 
s: start (A,B) h: go to (Y)i,j 
a: X=F(A) i: Y=F(Y) 
b: T=F(A) go to k 
c: A=F(B) 4: C=F(B) 
d: Y=F(T) k: Z=F(Y) 
e: go to (Y)f,g 1: Z=F(Z,C) 
f£: T=F(A) m: W=F(B) 
go to d t: exit(Z,W) 
g: C,Y=F(A,X) 


The flow of control through the program can 


expressed by a control flow expression 
Gesabiedé tae -a(h iii 4) eda 


be 


where "+" means OR and superscripts are used to 
express decision outcome at a k-way branch step. 
Each loop is represented by listing its steps 
once. The computation will remain valid as long 
as the execution of each step is preceded by all 
previous steps which write data into the source 
variable(s) and/or read data from the destination 
variable(s) of that step. An execution is in 
maximum parallel form iff the steps of the pro- 
gram can be scheduled for execution immediately 
when the above precedence rule is satisfied. To 
achieve maximum parallel executions a precedence 
expression is evaluated for each step. The pre- 
cedence expression for step c is the Boolean pro- 
duct term P(c)=s.a.b which is obtained examining 
the control flow expression starting with c and 
going to the left. Repeated steps are treated in 
a similar fashion and searching towards the left 
until either the beginning of the control flow 
expression or another occurrence of that step is 
found. Product terms 58 obtained are OR'ed to- _ 
gether, e.g., P(d)=bte .f. If a loop appears to 
the left of some step q which must be preceded by 
a state within the loop then q must be preceded 
by the entire loop. Thus P(g)=c.e! since g must 


be preceded by d which is within a loop and e 
represents the exit from that loop. Some of the 
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other precedence expressions are Pth)=de'g, P(k)= 


d.et.ig.(h9 ith). These expressions can be simpli- 
fied using Boolean algebra and the transitive pro- 
perty of the precedence relationship. 


The control flow can be implemented using op- 
erators which when "fired" enable execution of a 
program step and upon its completion sets its own 
output to logic 1. The operator can be reset wha 
the outputs of the operators it precedes become l. 
An operator which is outside a loop but precedes 
another operator within the loop is reset after 
the loop terminates. These conditions can be ex- 
pressed using reset expressions. For example R(s)= 
a.b.m since the start step s precedes a, b and nm, 
as seen from the simplified precedence expressior 
for these steps. Also R(e)=f+g and R(h)=i.1+j.k 
which are obtained taking the sum of the terms ob- 
tained for each decision outcome of these k-way 
branch steps. 


For pipelined execution through the program 
it is necessary to delay resetting the operators 
until all later steps that read data values from 
the éstination variables of that step have been 
executed. Such cases can be detected examining the 
control flow expression from left to right. The 
resulting delay conditions can be expressed as 
delay expressions. For example the delay expres- 
sion for step g is D(g)=(hO.itht).k. Finally the 
overall reset conditions are given by T(q)= 
R(q).D(q) which can be simplified in the usual 
Manner. Some of the final reset expressions for 
the sample program are T(s)=(h0+4) .m, T(b)=a.c.e, 
T(g)=(itj).k. 


The control operators should be implemented 
as bi-stable operators where the operator is set 
(reset) when its P(q) (T(q)) has value 1 and all 
literals in its T(q) (P(q)) have value 0. The 
particular order in which these expressions are 
evaluated does not alter the outcome of the com- 
putation. Steps scheduled for execution are 
placed into resource queues which are dynamicly 
modified according to some priority criteria. 
Setting the output of a control operator to l 
after the step is executed makes it appear as if 
its precedence expression has been evaluated just 
then. This leads to dynamic resource allocation. 
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SUMMARY 


A progressive specification method is proposed, 
since this appears to be the only method currently 
available for designing a particular software pro- 
duct (set of programs implementing. a given appfri- 
cation on a. particular system). Our aim is to ob- 
tain the final product (software, hardware) by a 
series of specification refinements. Each refine- 
ment consists in a transformation from one specifi- 
cation to a more detailed or more precise specifi- 
cation. A refinement may be considered faultless 

if it does not introduce any design error. To check 
a specification refinement a mathematical tool or a 
simulator (the latter giving only partial proof) 
can be used. 


We shall define the initial specifications So as 
the system representation or model that the designe 
obtains from the basic information on hand : this 
model ischosen by the designer. In our practical 
experience, the initial specifications are obtained 
in three steps : data graph > data graph labelled 
by primitives of data acquisition > control monitor, 
graphically represented by an extent of Petri Nets 
[PETERSON]. 

The specifications of a certain level are explained 
in greater detail by the addition of supplementary 
information on the system under design, e.g. : 

~- Temporal information : the designer specifies the 
implementation time of certain functions ; this 
means that hardware performance assumptions have 
been made. 

- Architecture : type and number of microprocessors, 
communications... 

- Choice of software : performance of a compiler or 
interpreter. 

The design method consists in making successive re- 
liable refinements from initial specifications. The 
process of checking that the refining process has 
not introduced any design error will be called 
"validation" and will be undertaken at every stage 
of refinement. A design error is the non-respect 

of the functional or operational conditions. Vali- 
dation will generally be achieved by simulation 

and in special cases, by mathematical analysis. 


The first level of validation is called Vo : 

On the model described above, it is initially assu- 
med that the hardware resources are unlimited and 
of infinite power. This means that the function 
computing time is almost zero. Therefore, the ini-. 
tial aim of the designer is to find errors that 
are independent from any construction. It is worth 
noting that the known or suspected "error" situa- 
tions are defined by the designer. In our approach, 
the following situations were detected by a simu- 
lator :. 

- evolution cycle of infinite duration, 


- locking, 
-~ fina] State of an evolution cycle depending on 
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the order of transition scruting. 

Also coming into this category are the data fre- 
quency incoherencies which may be detected by ana- 
lysis or by simulation. 


At the next level of specification S, and valida- 
tion Vj, a performance time is associated with 
each function, this time being correlated to the 
choice of a type of microprocessor, and the system 
is represented by a time-dependent model. The va- 
lidation can, for example, detect the following : 
a) Work loads incompatible with the system : con- 
sequently, the designer will have to consider ano- 
ther type of processor, or spread this task over 
several processors, or again use several processors 
in parallel. After modification, the validation 
process (simulation) is repeated. This validation 
can be also carried out algebraically. 

b) Observing response times : in addition to simu- 
lation, the validation can again be of the analyti- 
cal type. 

c) Data incoherency : in real time systems, cer- 
tain output functions depend on external inputs 
with the following restriction : these inputs 

must be coherent, i.e. sampled at the same time 
(same copy). For example, the data sampled at a 
given moment could all be of the same colour ; the 
validation condition (simulation) consists in che- 
cking that a function consumes only tokens of the 
same colour. 


At the next level of specification S, and valida- 
tion V., functions associated with niiees are de- 
fined by their input variables, their output va- 
riables and work on local variables. The main va- 
lidation consists in checking functions and/or 
predicates calculated in parallel and sharing the 
same variables. The designer may consider that 


this is a design error (tasks should only work on 
local variables) ; the designer may also preak up 
the function and specify (explain) the synchroni- 
sation (priority) selected. 


The final level of specification S, consists in 
the choice of an architecture : this isia fundamen- 
tal step in the refinement process since the se- 
lected architecture has to be taken into account 
accurately and the corresponding parameters have 
to be entered into the model. We shall consider 
two fundamental choices of architecture (one so- 
lution may consistin these two types of choice). 


Architecture with specific processor : A physical 
resource (processor) performs either the monito- 
ring operation (control monitor) or the functions 
of the system. In this case, an interpretor for 
this type of specifications (MAS system[3l) has 
been implemented on a multimicroprocessor system 
"4M" [5] : one microprocessor (among the four ones) 
is specifically allocated to the execution of the 
control monitor evolution. 


Decentralized architecture : Some of the physical 
processors perform both control tasks and functions. 
The initial net is, for instance, partitionned in- 
to connected parts 3; every sub-net is associated 
with a physical processor which performs both the 
control evolution and the functions computation. 
This means that a control monitor is stored in the 
RAM of each microprocessor. This approach was in- 
dustrially applied for an aircraft system 3; the 
obtained net was made up of four microcomputers 
which were linked by private communications. 

The validation V.,, by means of simulation, consists 
in checking that the anomalies, checked in the 
previous steps, do not appear in this more precise 
description. In addition to the functional anoma- 
lies (locking, conflict), we check that the ope- 
rational constraints (response times, for instance) 
are still true when ressource restrictions are in- 
troduced ; for instance, an upper bound of the task 
queue can be computed, in the second type architec- 
ture and we verify that the system hold these cons- 
traints. 


The proposed approach is an up-down one which in 
our sense minimizes the likelihood of design 
errors. It is obtained by using a general method 
of successive refinements from initial specifica- 
tions to final ones. | 
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